DF-Arc

DF-Arc is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining Morphological Pre-tokenization with PMI-based Phrase Merging.

It achieves near 1:1 fertility (1.16) and high semantic density.

Key Highlights

Architecture: Unigram SentencePiece (compatible with LlamaTokenizer).
Vocab Size: 128,000 tokens.
Baked-in Logic: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
Dialect Native: Trained on Egyptian dialogue, songs, and feedback corpora.

Performance

Model	Fertility	Total Tokens	Total Words
DF-Arc	1.162	133,485	114,882
GPT-4 (cl100k)	3.689	423,743	114,882
AraBERT v2	1.555	178,609	114,882
AraT5	1.193	137,107	114,882

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"

print(tokenizer.tokenize(text))
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']

Citation

@misc{df_arc,
  title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
  author={Dataflare Lab},
  year={2026},
  publisher={Hugging Face}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

dataflare
/

df-arc

DF-Arc

Key Highlights

Performance

Usage

Citation

Datasets used to train dataflare/df-arc