TokSuite

community

AI & ML interests

Tokenization, Robustness, LLMs

Recent Activity

gsaltintas updated a model about 22 hours ago

toksuite/meta-llama-Llama-3.2-7B

gsaltintas updated a model about 22 hours ago

toksuite/meta-llama-Llama-3.2-300M

gsaltintas updated a model about 22 hours ago

toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_888

View all activity

Papers

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

View all Papers

Organization Card

Community About org cards

TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.

Our code is available at https://github.com/r-three/Tokenizers.

Collections 4

View 4 collections

spaces 3

Quick Tokenizer Accuracy

Evaluate models on multiple-choice questions

Tokenizer Comparison

Compare tokenizers to split text into tokens

models 22

toksuite/meta-llama-Llama-3.2-7B

Text Generation • 8B • Updated about 22 hours ago • 112

toksuite/meta-llama-Llama-3.2-300M

Text Generation • 0.6B • Updated about 22 hours ago • 98

toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_222

Text Generation • 2B • Updated about 22 hours ago • 113

toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_888

Text Generation • 2B • Updated about 22 hours ago • 94

toksuite/google-gemma-2-2b

Text Generation • 2B • Updated Dec 25, 2025 • 604

toksuite/meta-llama-Llama-3.2-1B

Text Generation • 2B • Updated Dec 25, 2025 • 23

toksuite/CohereLabs-aya-expanse-8b

Text Generation • 2B • Updated Dec 25, 2025 • 14

toksuite/tiktoken-gpt-4o

Text Generation • 2B • Updated Dec 25, 2025 • 16

toksuite/common-pile-comma-v0.1

Text Generation • 2B • Updated Dec 25, 2025 • 19

toksuite/microsoft-Phi-3-mini-4k-instruct

Text Generation • 1B • Updated Dec 25, 2025 • 27

datasets 10

toksuite/toksuite_chinese

Viewer • Updated Jan 21 • 485 • 290

toksuite/toksuite_turkish

Viewer • Updated Jan 21 • 621 • 120

toksuite/toksuite_farsi

Viewer • Updated Jan 20 • 747 • 120

toksuite/toksuite_math

Viewer • Updated Jan 20 • 189 • 133

toksuite/toksuite_english

Viewer • Updated Jan 20 • 1.14k • 281

toksuite/toksuite_italian

Viewer • Updated Jan 20 • 1.09k • 146

toksuite/toksuite_stem

Viewer • Updated Jan 20 • 613 • 140

toksuite/toksuite_general

Viewer • Updated Jan 20 • 68 • 37

toksuite/toksuite_pretraining_data

Viewer • Updated Dec 18, 2025 • 107M • 1.27k

toksuite/Qwen-Qwen3-8B-toksuite-detokenized

Viewer • Updated Dec 18, 2025 • 28M • 34