Nanochat Tokenizer
This is the tokenizer from Andrej Karpathy's educational project nanochat. This is the first step from the speedrun.sh script.
Training
For now, we need to download the first ~2B characters of pretraining dataset using the dataset script in nanochat.
export NANOCHAT_BASE_DIR=".cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
python -m nanochat.dataset -n 8
Then, we can train the tokenizer with vocab size ~2B characters of data
python -m scripts.tok_train --max_chars=2000000000
And finally, evaluate:
python -m scripts.tok_eval
Tokenizer training
timestamp: 2025-10-14 10:29:05
- max_chars: 2,000,000,000
- doc_cap: 10,000
- vocab_size: 65,536
- train_time: 52.9085
- num_special_tokens: 9
- token_bytes_min: 1
- token_bytes_max: 32
- token_bytes_mean: 6.9197
- token_bytes_std: 2.8748
Tokenizer evaluation
timestamp: 2025-10-14 10:29:10
Comparison with GPT-2
Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
---|---|---|---|---|---|---|
news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% |
korean | 893 | 745 | 1.20 | 712 | 1.25 | +4.4% |
code | 1259 | 576 | 2.19 | 492 | 2.56 | +14.6% |
math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% |
science | 1112 | 260 | 4.28 | 228 | 4.88 | +12.3% |
fwe-train | 4208518 | 900364 | 4.67 | 856883 | 4.91 | +4.8% |
fwe-val | 4991242 | 1075364 | 4.64 | 1027241 | 4.86 | +4.5% |
Comparison with GPT-4
Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
---|---|---|---|---|---|---|
news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% |
korean | 893 | 364 | 2.45 | 712 | 1.25 | -95.6% |
code | 1259 | 309 | 4.07 | 492 | 2.56 | -59.2% |
math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% |
science | 1112 | 249 | 4.47 | 228 | 4.88 | +8.4% |
fwe-train | 4208518 | 874799 | 4.81 | 856883 | 4.91 | +2.0% |
fwe-val | 4991242 | 1048837 | 4.76 | 1027241 | 4.86 | +2.1% |
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support