Arabic
arabic
tokenizer
morphology
nlp
dialect

DF-Arc

DF-Arc is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining Morphological Pre-tokenization with PMI-based Phrase Merging.

It achieves near 1:1 fertility (1.26) and high semantic density.

Key Highlights

  • Architecture: Unigram SentencePiece (compatible with LlamaTokenizer).
  • Vocab Size: 64,000 tokens.
  • Baked-in Logic: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
  • Dialect Native: Trained on Egyptian dialogue, songs, and feedback corpora.

Performance

Model Fertility Total Tokens Total Words
DF-Arc 1.260 144,734 114,882
GPT-4 (cl100k) 3.689 423,743 114,882
AraBERT v2 1.555 178,609 114,882
AraT5 1.193 137,107 114,882

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"

print(tokenizer.tokenize(text))
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']

Citation

@misc{df_arc,
  title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
  author={Dataflare Lab},
  year={2026},
  publisher={Hugging Face}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Datasets used to train dataflare/df-arc