Update README.md

19d951e verified 5 months ago

4.33 kB

	---
	license: mit
	base_model:
	- meta-llama/Llama-3.2-3B-Instruct
	---

	# Llama3-2-3B-IT-Byte 🔢

	__[Llama3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) transferred to byte-level tokenization via [cross-tokenizer distillation](https://arxiv.org/abs/2503.20083).__

	__🚧This model is intended as a proof-of-concept that we can quickly & effectively transfer pretrained (subword-based) models to the byte-level. It is not optimized for production use (in particular, it is not optimized for speed)!🚧__

	## Benchmarks

	Llama3-2-3B-IT-Byte performs competitively although it has been trained only on 1.3B bytes (328M subword tokens total).

	\| \| MMLU \| BoolQ \| PiQA \| IFEval \| ARC-C \| Avg. \|
	\|-----------------------------------\|------\|-------\|-------\|--------\|-------\|------\|
	\| [EvaByte-6.5B-SFT](https://huggingface.co/EvaByte/EvaByte-SFT) \| 49.5 \| 79.5* \| 74.1* \| 60.2 \| 64.6* \| 65.6 \|
	\| [Llama3.2-3B-Instruct (original)](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) \| 62.4 \| 78.8 \| 76.9 \| 76.6 \| 43.9 \| 67.7 \|
	\| [Gemma2-2B-IT (original)](https://huggingface.co/google/gemma-2-2b-it) \| 56.9 \| 83.8 \| 79.6 \| 62.5 \| 50.4 \| 66.6 \|
	\| __Llama3-2-3B-IT-Byte (this model)__ \| __57.0__ \| __76.6__ \| __73.6__ \| __58.8__ \| __39.8__ \| __61.2__ \|
	\| [Gemma2-2B-IT-Byte](https://huggingface.co/benjamin/Gemma2-2B-IT-Byte) \| 51.0 \| 80.5 \| 71.5 \| 51.9 \| 38.2 \| 58.6 \|

	<small>*Numbers from EvaByte-6.5B (Base) since they are not reported for the SFT model.</small>

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("benjamin/Llama3-2-3B-IT-Byte")
	print("Vocab Size:", len(tokenizer)) # 256 bytes + some special tokens

	device = "cuda"
	model = AutoModelForCausalLM.from_pretrained(
	"benjamin/Llama3-2-3B-IT-Byte", trust_remote_code=True
	)
	model = model.to(device)

	tokens = tokenizer.apply_chat_template(
	[{"role": "user", "content": "Hello, how are you doing?"}], return_tensors="pt"
	)
	eot_id = tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	out = model.generate(tokens.to(model.device), eos_token_id=eot_id)
	print(tokenizer.decode(out[0]))

	```

	## Training

	This model has been trained using [`tokenkit`](https://github.com/bminixhofer/tokenkit) with the following command:

	```
	python3 scripts/cross_tokenizer_distill.py \
	--config=configs/cross_tokenizer_distill.yaml \
	--overrides \
	losses=[sft,alm_unconstrained,alm_latents] \
	multitask_aggregation_fn=approx_gradmag_preserve_mag \
	alm_mode=merge_by_space_prob+append_space \
	tokenizer_pair_bias_threshold=0.1 \
	max_student_length=2048 \
	steps=20000 \
	eval_interval=20000 \
	save_interval=20000 \
	optimizer.learning_rate=3.e-5 \
	optimizer.weight_decay=0.0 \
	optimizer.max_grad_norm=null \
	optimizer.grad_acc_steps=1 \
	train_model_mode=full \
	expand_input_ids=true \
	output_embeddings_mode=untie \
	eval.tasks=[arc_easy,arc_challenge,piqa,boolq,arithmetic,mmlu,ifeval,agieval_en,agieval_cn] \
	data.batch_size=32 \
	student.pretrained_model_name_or_path=benjamin/Llama-3.2-3B-Instruct-flax \
	student.tokenizer_name=meta-llama/Llama-3.2-3B-Instruct:source=Llama3 \
	target_tokenizer_name=meta-llama/Llama-3.2-3B-Instruct:source=Llama3:target=Llama3:conversion=byte \
	n_model_parallel=4 \
	n_data_parallel=4 \
	data.num_workers=16 \
	num_workers=16 \
	name=llama3_to_byte_20k
	```

	Training took ~26 hours on a TPU v4-32.

	## Future Work

	The current version of this model is trained for 20k steps with 32*2048 bytes per batch (= 1.3B bytes ≈ 328M subword tokens total). It was unexpected that it performs as well as it does with this very short training procedure. We plan to train a new version for more steps (you can also do so yourself using [`tokenkit`](https://github.com/bminixhofer/tokenkit)).

	To preserve efficiency, we would have to add (a combination of) [BLT-style hierarchical processing](https://arxiv.org/abs/2412.09871), [attention approximations](https://hkunlp.github.io/blog/2025/evabyte/), and [self-speculative decoding](https://arxiv.org/abs/2309.08168).

	## Acknowledgments

	Training was enabled by Cloud TPUs from Google’s TPU Research Cloud (TRC).