Nanochat Tokenizer

This is the tokenizer from Andrej Karpathy's educational project nanochat. This is the first step from the speedrun.sh script.

Training

For now, we need to download the first ~2B characters of pretraining dataset using the dataset script in nanochat.

export NANOCHAT_BASE_DIR=".cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
python -m nanochat.dataset -n 8

Then, we can train the tokenizer with vocab size ~2B characters of data

python -m scripts.tok_train --max_chars=2000000000

And finally, evaluate:

python -m scripts.tok_eval

Tokenizer training

timestamp: 2025-10-14 10:29:05

  • max_chars: 2,000,000,000
  • doc_cap: 10,000
  • vocab_size: 65,536
  • train_time: 52.9085
  • num_special_tokens: 9
  • token_bytes_min: 1
  • token_bytes_max: 32
  • token_bytes_mean: 6.9197
  • token_bytes_std: 2.8748

Tokenizer evaluation

timestamp: 2025-10-14 10:29:10

Comparison with GPT-2

Text Type Bytes GPT-2 Tokens GPT-2 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 404 4.50 375 4.85 +7.2%
korean 893 745 1.20 712 1.25 +4.4%
code 1259 576 2.19 492 2.56 +14.6%
math 1834 936 1.96 966 1.90 -3.2%
science 1112 260 4.28 228 4.88 +12.3%
fwe-train 4208518 900364 4.67 856883 4.91 +4.8%
fwe-val 4991242 1075364 4.64 1027241 4.86 +4.5%

Comparison with GPT-4

Text Type Bytes GPT-4 Tokens GPT-4 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 387 4.70 375 4.85 +3.1%
korean 893 364 2.45 712 1.25 -95.6%
code 1259 309 4.07 492 2.56 -59.2%
math 1834 832 2.20 966 1.90 -16.1%
science 1112 249 4.47 228 4.88 +8.4%
fwe-train 4208518 874799 4.81 856883 4.91 +2.0%
fwe-val 4991242 1048837 4.76 1027241 4.86 +2.1%
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support