Spaces:

nanochat-students
/

README

Running

App Files Files

xet

Community

🔪Tokenizer Thread

by burtenshaw - opened 3 days ago

Discussion

burtenshaw

nanochat students org 3 days ago

•

edited 3 days ago

The next step in speedrun is to train a tokenizer. This is something a lot of take for granted with tokenizers, but it's super useful to understand.

The tokenizer uses rust code in rustbpe and python bindings in nanochat.tokenizer as RustBPETokenizer.

# Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"

# Build the rustbpe Tokenizer
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml

It requires a slice of 8 shard of the data which we download first:

# Download the first ~2B characters of pretraining dataset
# look at dev/repackage_data_reference.py for details on how this data was prepared
# each data shard is ~250M chars
# so we download 2e9 / 250e6 = 8 data shards at this point
# each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk
python -m nanochat.dataset -n 8
# Immediately also kick off downloading more shards in the background while tokenizer trains
# See comment below for why 240 is the right number here

Then we can train and evaluate the tokenizer:

# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data
python -m scripts.tok_train --max_chars=2000000000
# evaluate the tokenizer (report compression ratio etc.)
python -m scripts.tok_eval

It creates a report in the .cache like so:

Text Type	Bytes	GPT-4 Tokens	GPT-4 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	387	4.70	375	4.85	+3.1%
korean	893	364	2.45	712	1.25	-95.6%
code	1259	309	4.07	492	2.56	-59.2%
math	1834	832	2.20	966	1.90	-16.1%
science	1112	249	4.47	228	4.88	+8.4%
fwe-train	4208518	874799	4.81	856883	4.91	+2.0%
fwe-val	4991242	1048837	4.76	1027241	4.86	+2.1%

I also published the tokenizer at https://huggingface.co/nanochat-students/nanochat-tokenizer-2B

And also a notebook to try it out: https://huggingface.co/nanochat-students/nanochat-tokenizer-2B/blob/main/tokenizer.ipynb

burtenshaw changed discussion title from 🔪Tokenizer to 🔪Tokenizer Thread 3 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment