Spaces:
Running
Running
🔪Tokenizer Thread
#3
by
burtenshaw
- opened
The next step in speedrun is to train a tokenizer. This is something a lot of take for granted with tokenizers, but it's super useful to understand.
The tokenizer uses rust code in rustbpe
and python bindings in nanochat.tokenizer
as RustBPETokenizer
.
# Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
# Build the rustbpe Tokenizer
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml
It requires a slice of 8 shard of the data which we download first:
# Download the first ~2B characters of pretraining dataset
# look at dev/repackage_data_reference.py for details on how this data was prepared
# each data shard is ~250M chars
# so we download 2e9 / 250e6 = 8 data shards at this point
# each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk
python -m nanochat.dataset -n 8
# Immediately also kick off downloading more shards in the background while tokenizer trains
# See comment below for why 240 is the right number here
Then we can train and evaluate the tokenizer:
# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data
python -m scripts.tok_train --max_chars=2000000000
# evaluate the tokenizer (report compression ratio etc.)
python -m scripts.tok_eval
It creates a report in the .cache
like so:
Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
---|---|---|---|---|---|---|
news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% |
korean | 893 | 364 | 2.45 | 712 | 1.25 | -95.6% |
code | 1259 | 309 | 4.07 | 492 | 2.56 | -59.2% |
math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% |
science | 1112 | 249 | 4.47 | 228 | 4.88 | +8.4% |
fwe-train | 4208518 | 874799 | 4.81 | 856883 | 4.91 | +2.0% |
fwe-val | 4991242 | 1048837 | 4.76 | 1027241 | 4.86 | +2.1% |
I also published the tokenizer at https://huggingface.co/nanochat-students/nanochat-tokenizer-2B
And also a notebook to try it out: https://huggingface.co/nanochat-students/nanochat-tokenizer-2B/blob/main/tokenizer.ipynb
burtenshaw
changed discussion title from
🔪Tokenizer
to 🔪Tokenizer Thread