geez-tokenizer / README.md
Hailay's picture
Update README.md
98aeb25 verified
---
language:
- am
- ti
license: mit
tags:
- tokenizer
- byte-pair-encoding
- bpe
- geez-script
- amharic
- tigrinya
- low-resource
- nlp
- morphology-aware
- Horn of Africa
datasets:
- HornMT
library_name: transformers
pipeline_tag: token-classification
widget:
- text: "!"
model-index:
- name: Geez BPE Tokenizer
results: []
---
# Geez Tokenizer (`Hailay/geez-tokenizer`)
A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora and supports morphologically rich low-resource languages.
## ๐Ÿง  Motivation
Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:
- Reduce over-segmentation errors
- Respect morpheme boundaries
- Improve language understanding for downstream tasks like Machine Translation and QA
## ๐Ÿ“š Training Details
- **Tokenizer Type**: BPE
- **Vocabulary Size**: 32,000
- **Pre-tokenizer**: Whitespace
- **Normalizer**: NFD โ†’ Lowercase โ†’ StripAccents
- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]`
## ๐Ÿ“ Files
- `vocab.json`: Vocabulary file
- `merges.txt`: Merge rules for BPE
- `tokenizer.json`: Full tokenizer config
- `tokenizer_config.json`: Hugging Face-compatible configuration
- `special_tokens_map.json`: Maps for special tokens
## ๐Ÿš€ Usage
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")
text = "แ‹จแŒแ‰ฅแ… แŠ แˆญแŠชแŠฆแˆŽแŒ‚แˆตแ‰ถแ‰ฝ แ‰ แˆณแŠฉแˆซ แŠ”แŠญแˆฎแ–แˆŠแˆต แ‹แˆตแŒฅ แ‹จแ‰ฐแŒˆแŠ˜แ‹แŠ• แ‰ตแˆแ‰แŠ• แˆ˜แ‰ƒแ‰ฅแˆญ แŠ แŒแŠแ‰ฐแ‹‹แˆแข"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print("Tokens:", tokens)
print("Token IDs:", ids)
## ๐Ÿ“Š Intended Use
This tokenizer is best suited for:
Low-resource NLP pipelines
Machine Translation
Question Answering
Named Entity Recognition
Morphological analysis
โœ‹ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.
Some compound verbs and morphologically fused words may still require linguistic preprocessing.
Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.
โœ… #**Evaluation**
The tokenizer was evaluated manually on:
Token coverage of Tigrinya/Amharic corpora
Morphological preservation
Reduction of BPE segmentation errors
Quantitative metrics to be published in an accompanying paper.
๐Ÿ“œ #**License**
This tokenizer is licensed under the MIT License.
๐Ÿ“Œ #**Citation**
@misc{hailay2025geez,
title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
author={Teklehaymanot, Hailay},
year={2025},
howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}