random-llama-small / README.md
saneowl's picture
correct mistake in readme
ec7d67c verified
---
license: mit
pipeline_tag: text-generation
library_name: transformers
---
# Random-Llama-Small
## Model Overview
**Random-Llama-Small** is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from `HuggingFaceTB/SmolLM2-1.7B-Instruct` and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models.
---
## Key Details
- **Architecture:** LLaMA (Causal Language Model)
- **Parameters:** ~2B
- **Hidden Size:** 2304
- **Layers:** 22
- **Attention Heads:** 36 (with 9 key-value heads for grouped-query attention)
- **Intermediate Size:** 9216
- **Vocabulary Size:** 128256
- **Tokenizer:** Imported from `HuggingFaceTB/SmolLM2-1.7B-Instruct`
- **Precision:** bfloat16
- **Max Context Length:** 131,072 tokens (with RoPE scaling)
- **License:** MIT
---
## LLaMA Architecture
The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features:
### Core Components
- **Decoder-Only Transformer:** Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation.
- **Grouped-Query Attention (GQA):** 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost.
- **Rotary Position Embeddings (RoPE):** Embeds positional information with scaling, enabling a context length of up to 131,072 tokens.
- **Swiglu Activation:** Uses SiLU (Swish) activation in the FFN for improved expressiveness.
- **RMSNorm:** Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence.
- **Tied Embeddings:** Input and output embeddings share weights (`tie_word_embeddings=True`), reducing parameter count by ~295M.
---
## Benefits of LLaMA Architecture
- **Efficiency:** High throughput, low memory use.
- **Scalability:** Works well across model sizes.
- **Flexibility:** Long-context support and task adaptability.
- **Research-Friendly:** Great for exploring attention, positional encoding, and training dynamics.
---
## Random-Llama-Small Specifics
This model uses random weights and:
- Has ~2B parameters across 22 layers.
- Uses a 2304 hidden size and 9216 FFN size.
- Supports 128K+ vocab tokens and bfloat16 precision.
- Supports extended context lengths of 131,072 tokens.
---
## Intended Use
- Research on transformer dynamics, optimization, or architectural changes.
- Baseline for pretraining or task-specific fine-tuning.
- Experimentation with scaling laws or custom architectures.
---
## Out-of-Scope Use
- **Not for direct production deployment.**
- **Not suitable for tasks needing coherence or accuracy without training.**
---
## Usage
### Requirements
- `transformers >= 4.45.0`
- `torch >= 2.0`
- GPU with ≥ 6GB VRAM (24GB+ for training)
---
### Inference Example
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="reflex-ai/random-llama-small")
print(pipe(messages))
```
> Note: Outputs will be random and incoherent due to the model’s untrained state.
---
### Training Example
```python
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer
model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small")
tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small")
training_args = TrainingArguments(
output_dir="./random_llama_small_finetuned",
per_device_train_batch_size=4,
num_train_epochs=3,
fp16=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=your_dataset,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)
trainer.train()
```
---
## Limitations
- **Random Initialization:** Needs significant training to be useful.
- **Resource Intensive:** High computational cost.
- **No Pretraining Data:** Users must provide their own.
- **Tokenizer Constraint:** May not suit all domains.
---
## Benefits and Potential
- **Customizability:** A blank slate for full control of objectives and data.
- **Research Insights:** Ideal for understanding early-stage LLM behavior.
- **Scalable Baseline:** Balances size and research feasibility.
- **Extended Context:** Useful for long-form tasks post-training.
---
## Model Configuration
```json
{
"architectures": ["LlamaForCausalLM"],
"hidden_size": 2304,
"num_hidden_layers": 22,
"num_attention_heads": 36,
"num_key_value_heads": 9,
"intermediate_size": 9216,
"vocab_size": 128256,
"max_position_embeddings": 131072,
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"torch_dtype": "bfloat16",
"tie_word_embeddings": true
}
```
---
## Ethical Considerations
- **Untrained Safety:** No immediate harmful outputs, but ethics matter during training.
- **Environmental Impact:** Large-scale training consumes energy; optimize and use green compute.
- **Accessibility:** Resource requirements may limit use by smaller research teams.
---
## Contact
For questions or issues, please open an issue on the Hugging Face repository.
> *Model card created on April 20, 2025.*