|
--- |
|
language: |
|
- am |
|
- ti |
|
license: mit |
|
tags: |
|
- tokenizer |
|
- byte-pair-encoding |
|
- bpe |
|
- geez-script |
|
- amharic |
|
- tigrinya |
|
- low-resource |
|
- nlp |
|
- morphology-aware |
|
- Horn of Africa |
|
datasets: |
|
- HornMT |
|
library_name: transformers |
|
pipeline_tag: token-classification |
|
widget: |
|
- text: "!" |
|
model-index: |
|
- name: Geez BPE Tokenizer |
|
results: [] |
|
--- |
|
|
|
# Geez Tokenizer (`Hailay/geez-tokenizer`) |
|
|
|
A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora and supports morphologically rich low-resource languages. |
|
|
|
## ๐ง Motivation |
|
|
|
Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to: |
|
|
|
- Reduce over-segmentation errors |
|
- Respect morpheme boundaries |
|
- Improve language understanding for downstream tasks like Machine Translation and QA |
|
|
|
## ๐ Training Details |
|
|
|
- **Tokenizer Type**: BPE |
|
- **Vocabulary Size**: 32,000 |
|
- **Pre-tokenizer**: Whitespace |
|
- **Normalizer**: NFD โ Lowercase โ StripAccents |
|
- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]` |
|
- **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]` |
|
|
|
## ๐ Files |
|
|
|
- `vocab.json`: Vocabulary file |
|
- `merges.txt`: Merge rules for BPE |
|
- `tokenizer.json`: Full tokenizer config |
|
- `tokenizer_config.json`: Hugging Face-compatible configuration |
|
- `special_tokens_map.json`: Maps for special tokens |
|
|
|
## ๐ Usage |
|
|
|
```python |
|
from transformers import PreTrainedTokenizerFast |
|
|
|
tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer") |
|
|
|
text = "แจแแฅแ
แ แญแชแฆแแแตแถแฝ แ แณแฉแซ แแญแฎแแแต แแตแฅ แจแฐแแแแ แตแแแ แแแฅแญ แ แแแฐแแแข" |
|
tokens = tokenizer.tokenize(text) |
|
ids = tokenizer.encode(text) |
|
|
|
print("Tokens:", tokens) |
|
print("Token IDs:", ids) |
|
|
|
## ๐ Intended Use |
|
|
|
This tokenizer is best suited for: |
|
|
|
Low-resource NLP pipelines |
|
|
|
Machine Translation |
|
|
|
Question Answering |
|
|
|
Named Entity Recognition |
|
|
|
Morphological analysis |
|
|
|
|
|
|
|
โ #**Limitations** |
|
It is optimized for Geez-script languages and might not generalize to others. |
|
|
|
Some compound verbs and morphologically fused words may still require linguistic preprocessing. |
|
|
|
Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching. |
|
|
|
|
|
โ
#**Evaluation** |
|
The tokenizer was evaluated manually on: |
|
|
|
Token coverage of Tigrinya/Amharic corpora |
|
|
|
Morphological preservation |
|
|
|
Reduction of BPE segmentation errors |
|
|
|
Quantitative metrics to be published in an accompanying paper. |
|
|
|
๐ #**License** |
|
This tokenizer is licensed under the MIT License. |
|
๐ #**Citation** |
|
|
|
@misc{hailay2025geez, |
|
title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages}, |
|
author={Teklehaymanot, Hailay}, |
|
year={2025}, |
|
howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}}, |
|
} |
|
|