random-llama-small / README.md

correct mistake in readme

ec7d67c verified 4 months ago

5.81 kB

	---
	license: mit
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Random-Llama-Small

	## Model Overview

	Random-Llama-Small is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from `HuggingFaceTB/SmolLM2-1.7B-Instruct` and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models.

	---

	## Key Details

	- Architecture: LLaMA (Causal Language Model)
	- Parameters: ~2B
	- Hidden Size: 2304
	- Layers: 22
	- Attention Heads: 36 (with 9 key-value heads for grouped-query attention)
	- Intermediate Size: 9216
	- Vocabulary Size: 128256
	- Tokenizer: Imported from `HuggingFaceTB/SmolLM2-1.7B-Instruct`
	- Precision: bfloat16
	- Max Context Length: 131,072 tokens (with RoPE scaling)
	- License: MIT

	---

	## LLaMA Architecture

	The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features:

	### Core Components

	- Decoder-Only Transformer: Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation.
	- Grouped-Query Attention (GQA): 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost.
	- Rotary Position Embeddings (RoPE): Embeds positional information with scaling, enabling a context length of up to 131,072 tokens.
	- Swiglu Activation: Uses SiLU (Swish) activation in the FFN for improved expressiveness.
	- RMSNorm: Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence.
	- Tied Embeddings: Input and output embeddings share weights (`tie_word_embeddings=True`), reducing parameter count by ~295M.

	---

	## Benefits of LLaMA Architecture

	- Efficiency: High throughput, low memory use.
	- Scalability: Works well across model sizes.
	- Flexibility: Long-context support and task adaptability.
	- Research-Friendly: Great for exploring attention, positional encoding, and training dynamics.

	---

	## Random-Llama-Small Specifics

	This model uses random weights and:
	- Has ~2B parameters across 22 layers.
	- Uses a 2304 hidden size and 9216 FFN size.
	- Supports 128K+ vocab tokens and bfloat16 precision.
	- Supports extended context lengths of 131,072 tokens.

	---

	## Intended Use

	- Research on transformer dynamics, optimization, or architectural changes.
	- Baseline for pretraining or task-specific fine-tuning.
	- Experimentation with scaling laws or custom architectures.

	---

	## Out-of-Scope Use

	- Not for direct production deployment.
	- Not suitable for tasks needing coherence or accuracy without training.

	---

	## Usage

	### Requirements

	- `transformers >= 4.45.0`
	- `torch >= 2.0`
	- GPU with ≥ 6GB VRAM (24GB+ for training)

	---

	### Inference Example

	```python
	# Use a pipeline as a high-level helper
	from transformers import pipeline

	messages = [
	{"role": "user", "content": "Who are you?"},
	]
	pipe = pipeline("text-generation", model="reflex-ai/random-llama-small")
	print(pipe(messages))
	```

	> Note: Outputs will be random and incoherent due to the model’s untrained state.

	---

	### Training Example

	```python
	from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer

	model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small")
	tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small")

	training_args = TrainingArguments(
	output_dir="./random_llama_small_finetuned",
	per_device_train_batch_size=4,
	num_train_epochs=3,
	fp16=True,
	)

	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=your_dataset,
	data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
	)

	trainer.train()
	```

	---

	## Limitations

	- Random Initialization: Needs significant training to be useful.
	- Resource Intensive: High computational cost.
	- No Pretraining Data: Users must provide their own.
	- Tokenizer Constraint: May not suit all domains.

	---

	## Benefits and Potential

	- Customizability: A blank slate for full control of objectives and data.
	- Research Insights: Ideal for understanding early-stage LLM behavior.
	- Scalable Baseline: Balances size and research feasibility.
	- Extended Context: Useful for long-form tasks post-training.

	---

	## Model Configuration

	```json
	{
	"architectures": ["LlamaForCausalLM"],
	"hidden_size": 2304,
	"num_hidden_layers": 22,
	"num_attention_heads": 36,
	"num_key_value_heads": 9,
	"intermediate_size": 9216,
	"vocab_size": 128256,
	"max_position_embeddings": 131072,
	"rope_scaling": {
	"factor": 32.0,
	"high_freq_factor": 4.0,
	"low_freq_factor": 1.0,
	"original_max_position_embeddings": 8192,
	"rope_type": "llama3"
	},
	"torch_dtype": "bfloat16",
	"tie_word_embeddings": true
	}
	```

	---

	## Ethical Considerations

	- Untrained Safety: No immediate harmful outputs, but ethics matter during training.
	- Environmental Impact: Large-scale training consumes energy; optimize and use green compute.
	- Accessibility: Resource requirements may limit use by smaller research teams.

	---

	## Contact

	For questions or issues, please open an issue on the Hugging Face repository.

	> Model card created on April 20, 2025.