|
--- |
|
license: mit |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
--- |
|
|
|
# Random-Llama-Small |
|
|
|
## Model Overview |
|
|
|
**Random-Llama-Small** is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from `HuggingFaceTB/SmolLM2-1.7B-Instruct` and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models. |
|
|
|
--- |
|
|
|
## Key Details |
|
|
|
- **Architecture:** LLaMA (Causal Language Model) |
|
- **Parameters:** ~2B |
|
- **Hidden Size:** 2304 |
|
- **Layers:** 22 |
|
- **Attention Heads:** 36 (with 9 key-value heads for grouped-query attention) |
|
- **Intermediate Size:** 9216 |
|
- **Vocabulary Size:** 128256 |
|
- **Tokenizer:** Imported from `HuggingFaceTB/SmolLM2-1.7B-Instruct` |
|
- **Precision:** bfloat16 |
|
- **Max Context Length:** 131,072 tokens (with RoPE scaling) |
|
- **License:** MIT |
|
|
|
--- |
|
|
|
## LLaMA Architecture |
|
|
|
The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features: |
|
|
|
### Core Components |
|
|
|
- **Decoder-Only Transformer:** Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation. |
|
- **Grouped-Query Attention (GQA):** 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost. |
|
- **Rotary Position Embeddings (RoPE):** Embeds positional information with scaling, enabling a context length of up to 131,072 tokens. |
|
- **Swiglu Activation:** Uses SiLU (Swish) activation in the FFN for improved expressiveness. |
|
- **RMSNorm:** Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence. |
|
- **Tied Embeddings:** Input and output embeddings share weights (`tie_word_embeddings=True`), reducing parameter count by ~295M. |
|
|
|
--- |
|
|
|
## Benefits of LLaMA Architecture |
|
|
|
- **Efficiency:** High throughput, low memory use. |
|
- **Scalability:** Works well across model sizes. |
|
- **Flexibility:** Long-context support and task adaptability. |
|
- **Research-Friendly:** Great for exploring attention, positional encoding, and training dynamics. |
|
|
|
--- |
|
|
|
## Random-Llama-Small Specifics |
|
|
|
This model uses random weights and: |
|
- Has ~2B parameters across 22 layers. |
|
- Uses a 2304 hidden size and 9216 FFN size. |
|
- Supports 128K+ vocab tokens and bfloat16 precision. |
|
- Supports extended context lengths of 131,072 tokens. |
|
|
|
--- |
|
|
|
## Intended Use |
|
|
|
- Research on transformer dynamics, optimization, or architectural changes. |
|
- Baseline for pretraining or task-specific fine-tuning. |
|
- Experimentation with scaling laws or custom architectures. |
|
|
|
--- |
|
|
|
## Out-of-Scope Use |
|
|
|
- **Not for direct production deployment.** |
|
- **Not suitable for tasks needing coherence or accuracy without training.** |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
### Requirements |
|
|
|
- `transformers >= 4.45.0` |
|
- `torch >= 2.0` |
|
- GPU with ≥ 6GB VRAM (24GB+ for training) |
|
|
|
--- |
|
|
|
### Inference Example |
|
|
|
```python |
|
# Use a pipeline as a high-level helper |
|
from transformers import pipeline |
|
|
|
messages = [ |
|
{"role": "user", "content": "Who are you?"}, |
|
] |
|
pipe = pipeline("text-generation", model="reflex-ai/random-llama-small") |
|
print(pipe(messages)) |
|
``` |
|
|
|
> Note: Outputs will be random and incoherent due to the model’s untrained state. |
|
|
|
--- |
|
|
|
### Training Example |
|
|
|
```python |
|
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer |
|
|
|
model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small") |
|
tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small") |
|
|
|
training_args = TrainingArguments( |
|
output_dir="./random_llama_small_finetuned", |
|
per_device_train_batch_size=4, |
|
num_train_epochs=3, |
|
fp16=True, |
|
) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=your_dataset, |
|
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False), |
|
) |
|
|
|
trainer.train() |
|
``` |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
- **Random Initialization:** Needs significant training to be useful. |
|
- **Resource Intensive:** High computational cost. |
|
- **No Pretraining Data:** Users must provide their own. |
|
- **Tokenizer Constraint:** May not suit all domains. |
|
|
|
--- |
|
|
|
## Benefits and Potential |
|
|
|
- **Customizability:** A blank slate for full control of objectives and data. |
|
- **Research Insights:** Ideal for understanding early-stage LLM behavior. |
|
- **Scalable Baseline:** Balances size and research feasibility. |
|
- **Extended Context:** Useful for long-form tasks post-training. |
|
|
|
--- |
|
|
|
## Model Configuration |
|
|
|
```json |
|
{ |
|
"architectures": ["LlamaForCausalLM"], |
|
"hidden_size": 2304, |
|
"num_hidden_layers": 22, |
|
"num_attention_heads": 36, |
|
"num_key_value_heads": 9, |
|
"intermediate_size": 9216, |
|
"vocab_size": 128256, |
|
"max_position_embeddings": 131072, |
|
"rope_scaling": { |
|
"factor": 32.0, |
|
"high_freq_factor": 4.0, |
|
"low_freq_factor": 1.0, |
|
"original_max_position_embeddings": 8192, |
|
"rope_type": "llama3" |
|
}, |
|
"torch_dtype": "bfloat16", |
|
"tie_word_embeddings": true |
|
} |
|
``` |
|
|
|
--- |
|
|
|
## Ethical Considerations |
|
|
|
- **Untrained Safety:** No immediate harmful outputs, but ethics matter during training. |
|
- **Environmental Impact:** Large-scale training consumes energy; optimize and use green compute. |
|
- **Accessibility:** Resource requirements may limit use by smaller research teams. |
|
|
|
--- |
|
|
|
## Contact |
|
|
|
For questions or issues, please open an issue on the Hugging Face repository. |
|
|
|
> *Model card created on April 20, 2025.* |
|
|