TinyStories Llama Model

Model Description

This is a small Llama-architecture language model trained on the TinyStories dataset. The model is designed to generate simple, coherent children's stories using a vocabulary and concepts that a typical 3-4 year old would understand.

Model Architecture: Llama 2
Training Framework: PyTorch
Implementation: Based on llama2.c

Model Details

Architecture Hyperparameters

  • Dimension: 288
  • Number of Layers: 6
  • Number of Attention Heads: 6
  • Number of KV Heads: 6
  • Vocabulary Size: 32,000 (Llama 2 tokenizer)
  • Maximum Sequence Length: 256 tokens
  • Dropout: 0.0
  • Hidden Dimension Multiple: 32

Total Parameters: ~15M

Training Hyperparameters

  • Batch Size: 128 (micro-batch)
  • Gradient Accumulation Steps: 4
  • Effective Batch Size: 512
  • Learning Rate: 5e-4 (max)
  • Learning Rate Schedule: Cosine decay with warmup
  • Warmup Iterations: 1,000
  • Total Training Iterations: 100,000
  • Weight Decay: 0.1
  • Beta1: 0.9
  • Beta2: 0.95
  • Gradient Clipping: 1.0
  • Optimizer: AdamW
  • Precision: bfloat16 (with mixed precision training)

Tokens per Iteration: ~65,536 (4 grad accum ร— 1 process ร— 64 batch ร— 256 seq len)

Intended Use

This model is intended for:

  • Generating simple children's stories
  • Educational demonstrations of small-scale language model training
  • Research into emergent capabilities in small language models
  • Experimentation with efficient inference (e.g., pure C implementation)

Limitations

  • Domain-Specific: The model is trained exclusively on simple stories and will not perform well on general text generation tasks
  • Vocabulary: Limited to concepts and language appropriate for very young children
  • Context Length: Maximum sequence length of 256 tokens limits story length
  • No Instruction Following: This is a base model without instruction tuning

Training Data

The model was trained on the TinyStories dataset, which consists of short stories generated to contain only words that a typical 3-4 year old would understand. The dataset was created to study the capabilities of small language models.

Dataset Size: ~2.1M stories
Vocabulary: Words understandable by 3-4 year olds
Content: Simple narratives, common objects, basic emotions and actions

Example Outputs

Prompt: "Once upon a time, there was a little girl named Lily."

Generation (temperature=0.8, top_p=0.9):

She loved to play outside in the park. One day, she saw a big, red ball. 
She wanted to play with it, but it was too high. Lily's mom said, "Let's 
go get it together!" They worked together and got the ball down. Lily was 
so happy! She played with the ball all day long.

Citation

If you use this model or the llama2.c implementation, please cite:

@misc{llama2c,
  author = {Andrej Karpathy},
  title = {llama2.c: Inference Llama 2 in one file of pure C},
  year = {2023},
  publisher = {GitHub},
  url = {https://github.com/karpathy/llama2.c}
}

@article{eldan2023tinystories,
  title={TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
  author={Eldan, Ronen and Li, Yuanzhi},
  journal={arXiv preprint arXiv:2305.07759},
  year={2023}
}

License

MIT License - See the LICENSE file for details.

Acknowledgments

  • Model architecture and training code adapted from llama2.c by Andrej Karpathy
  • Trained on the TinyStories dataset by Ronen Eldan and Yuanzhi Li
  • Based on the Llama 2 architecture by Meta AI
Downloads last month
20
Safetensors
Model size
15.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train sdobson/tinystories-llama-15m

Space using sdobson/tinystories-llama-15m 1