nanochat
nanochat is a 561M parameter transformer language model trained for conversational AI tasks. This model demonstrates that capable chat models can be trained efficiently on modest hardware budgets (~$100 on 8x H100 GPUs).
Read about the process at https://samdobson.uk/posts/training-a-chatgpt-clone-for-cheap/
Chat with the model at https://huggingface.co/spaces/sdobson/nanochat
Model Description
- Developed by: Andrej Karpathy
- Trained by: Sam Dobson
- Model type: Transformer-based causal language model
- Language(s): English
- License: MIT
- Parameters: 560,988,160 (~561M)
Architecture
- Layers: 20
- Hidden size: 1280 channels
- Attention heads: 10
- Head dimension: 128
- Vocabulary size: 65,536 tokens
Training Details
Training Data
nanochat was trained in multiple stages:
- Pretraining: 100B token subset of FineWeb-EDU (11.2B tokens processed)
- Midtraining: SmolTalk conversations, MMLU multiple choice questions, GSM8K math problems
- Supervised Fine-tuning (SFT): Conversational adaptation data
Training Procedure
Tokenization
- Custom Rust-based tokenizer
- Vocabulary: 65,536 tokens
- Compression ratio: 4.8 characters per token
Training Infrastructure
- Hardware: 8x H100 GPUs (Lambda GPU Cloud)
- Training time: ~3 hours for pretraining stage
- Estimated compute: ~4e19 FLOPs
- Total cost: ~$100
Training Stages
The model was trained in three stages:
- Pretraining on web text (FineWeb-EDU)
- Midtraining on domain-specific datasets (reasoning, conversation, maths)
- Supervised fine-tuning for chat optimisation
Performance
Benchmark Results
Benchmark | Score | Description |
---|---|---|
MMLU | 23.99% | Multitask language understanding |
GSM8K | 4.47% | Grade school math problems |
HumanEval | 6.71% | Python code generation |
ARC-Easy | 24.79% | Science questions (easy) |
ARC-Challenge | 24.32% | Science questions (hard) |
ChatCORE | 1.73% | Conversational reasoning |
Intended Use
Direct Use
nanochat is designed for:
- Conversational AI applications
- Research on efficient language model training
- Educational purposes for understanding LLM training pipelines
- Low-resource deployment scenarios
Downstream Use
The model can be fine-tuned for specific conversational tasks or used as a base model for further domain adaptation.
Out-of-Scope Use
- Production-grade conversational AI (the model is relatively small and has limited capabilities)
- Tasks requiring specialised knowledge or high accuracy
- Critical applications where errors could cause harm
Limitations and Bias
- Small scale: At 561M parameters, this model has significantly fewer capabilities than larger models (1B+ parameters)
- Limited training: Trained on only 11.2B tokens, which is modest by modern standards
- Performance: Benchmark scores indicate limited reasoning and mathematical capabilities
- Bias: Inherits biases from training data (FineWeb-EDU, SmolTalk, etc.)
- Language: English-only
Inference guide
Simon Willison created a script to allow this to run on CPU on MacOS:
cd /tmp
git clone https://huggingface.co/sdobson/nanochat
uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
--model-dir /tmp/nanochat \
--prompt "Tell me about dogs."
Otherwise you can:
- Download all files
- Put
tokenizer.pkl
andtoken_bytes.pt
in~/.cache/nanochat/tokenizer
- Put
model_000650.pt
andmeta_000650.json
in~/.cache/nanochat/chatsft_checkpoints/d20
- Clone https://github.com/karpathy/nanochat
- Run
uv sync
followed byuv run python -m scripts.chat_web
Citation
Repository: github.com/karpathy/nanochat
@software{nanochat2025,
author = {Karpathy, Andrej},
title = {nanochat: A 561M parameter conversational language model},
year = {2025},
url = {https://github.com/karpathy/nanochat}
}
Model Card Author
Sam Dobson
Datasets used to train sdobson/nanochat
Spaces using sdobson/nanochat 3
Evaluation results
- accuracy on MMLUself-reported31.510
- accuracy on GSM8Kself-reported4.550
- pass@1 on HumanEvalself-reported8.540