File size: 5,807 Bytes
c59110d
 
b941567
6b0486c
9e7c31d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec7d67c
9e7c31d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
license: mit
pipeline_tag: text-generation
library_name: transformers
---

# Random-Llama-Small

## Model Overview

**Random-Llama-Small** is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from `HuggingFaceTB/SmolLM2-1.7B-Instruct` and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models.

---

## Key Details

- **Architecture:** LLaMA (Causal Language Model)  
- **Parameters:** ~2B 
- **Hidden Size:** 2304  
- **Layers:** 22  
- **Attention Heads:** 36 (with 9 key-value heads for grouped-query attention)  
- **Intermediate Size:** 9216  
- **Vocabulary Size:** 128256  
- **Tokenizer:** Imported from `HuggingFaceTB/SmolLM2-1.7B-Instruct`  
- **Precision:** bfloat16  
- **Max Context Length:** 131,072 tokens (with RoPE scaling)  
- **License:** MIT 

---

## LLaMA Architecture

The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features:

### Core Components

- **Decoder-Only Transformer:** Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation.
- **Grouped-Query Attention (GQA):** 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost.
- **Rotary Position Embeddings (RoPE):** Embeds positional information with scaling, enabling a context length of up to 131,072 tokens.
- **Swiglu Activation:** Uses SiLU (Swish) activation in the FFN for improved expressiveness.
- **RMSNorm:** Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence.
- **Tied Embeddings:** Input and output embeddings share weights (`tie_word_embeddings=True`), reducing parameter count by ~295M.

---

## Benefits of LLaMA Architecture

- **Efficiency:** High throughput, low memory use.
- **Scalability:** Works well across model sizes.
- **Flexibility:** Long-context support and task adaptability.
- **Research-Friendly:** Great for exploring attention, positional encoding, and training dynamics.

---

## Random-Llama-Small Specifics

This model uses random weights and:
- Has ~2B parameters across 22 layers.
- Uses a 2304 hidden size and 9216 FFN size.
- Supports 128K+ vocab tokens and bfloat16 precision.
- Supports extended context lengths of 131,072 tokens.

---

## Intended Use

- Research on transformer dynamics, optimization, or architectural changes.
- Baseline for pretraining or task-specific fine-tuning.
- Experimentation with scaling laws or custom architectures.

---

## Out-of-Scope Use

- **Not for direct production deployment.**
- **Not suitable for tasks needing coherence or accuracy without training.**

---

## Usage

### Requirements

- `transformers >= 4.45.0`  
- `torch >= 2.0`  
- GPU with ≥ 6GB VRAM (24GB+ for training)

---

### Inference Example

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="reflex-ai/random-llama-small")
print(pipe(messages))
```

> Note: Outputs will be random and incoherent due to the model’s untrained state.

---

### Training Example

```python
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer

model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small")
tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small")

training_args = TrainingArguments(
    output_dir="./random_llama_small_finetuned",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

trainer.train()
```

---

## Limitations

- **Random Initialization:** Needs significant training to be useful.
- **Resource Intensive:** High computational cost.
- **No Pretraining Data:** Users must provide their own.
- **Tokenizer Constraint:** May not suit all domains.

---

## Benefits and Potential

- **Customizability:** A blank slate for full control of objectives and data.
- **Research Insights:** Ideal for understanding early-stage LLM behavior.
- **Scalable Baseline:** Balances size and research feasibility.
- **Extended Context:** Useful for long-form tasks post-training.

---

## Model Configuration

```json
{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 2304,
  "num_hidden_layers": 22,
  "num_attention_heads": 36,
  "num_key_value_heads": 9,
  "intermediate_size": 9216,
  "vocab_size": 128256,
  "max_position_embeddings": 131072,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "torch_dtype": "bfloat16",
  "tie_word_embeddings": true
}
```

---

## Ethical Considerations

- **Untrained Safety:** No immediate harmful outputs, but ethics matter during training.
- **Environmental Impact:** Large-scale training consumes energy; optimize and use green compute.
- **Accessibility:** Resource requirements may limit use by smaller research teams.

---

## Contact

For questions or issues, please open an issue on the Hugging Face repository.

> *Model card created on April 20, 2025.*