🔪Tokenizer Thread

#3
by burtenshaw - opened
nanochat students org
edited 3 days ago

The next step in speedrun is to train a tokenizer. This is something a lot of take for granted with tokenizers, but it's super useful to understand.

The tokenizer uses rust code in rustbpe and python bindings in nanochat.tokenizer as RustBPETokenizer.

# Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"

# Build the rustbpe Tokenizer
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml

It requires a slice of 8 shard of the data which we download first:

# Download the first ~2B characters of pretraining dataset
# look at dev/repackage_data_reference.py for details on how this data was prepared
# each data shard is ~250M chars
# so we download 2e9 / 250e6 = 8 data shards at this point
# each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk
python -m nanochat.dataset -n 8
# Immediately also kick off downloading more shards in the background while tokenizer trains
# See comment below for why 240 is the right number here

Then we can train and evaluate the tokenizer:

# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data
python -m scripts.tok_train --max_chars=2000000000
# evaluate the tokenizer (report compression ratio etc.)
python -m scripts.tok_eval

It creates a report in the .cache like so:

Text Type Bytes GPT-4 Tokens GPT-4 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 387 4.70 375 4.85 +3.1%
korean 893 364 2.45 712 1.25 -95.6%
code 1259 309 4.07 492 2.56 -59.2%
math 1834 832 2.20 966 1.90 -16.1%
science 1112 249 4.47 228 4.88 +8.4%
fwe-train 4208518 874799 4.81 856883 4.91 +2.0%
fwe-val 4991242 1048837 4.76 1027241 4.86 +2.1%

I also published the tokenizer at https://huggingface.co/nanochat-students/nanochat-tokenizer-2B

And also a notebook to try it out: https://huggingface.co/nanochat-students/nanochat-tokenizer-2B/blob/main/tokenizer.ipynb

burtenshaw changed discussion title from 🔪Tokenizer to 🔪Tokenizer Thread

Sign up or log in to comment