README.md · Hailay/geez-tokenizer at main

geez-tokenizer / README.md

Hailay

Update README.md

98aeb25 verified 2 months ago

preview code

raw

history blame contribute delete

2.99 kB

	---
	language:
	- am
	- ti
	license: mit
	tags:
	- tokenizer
	- byte-pair-encoding
	- bpe
	- geez-script
	- amharic
	- tigrinya
	- low-resource
	- nlp
	- morphology-aware
	- Horn of Africa
	datasets:
	- HornMT
	library_name: transformers
	pipeline_tag: token-classification
	widget:
	- text: "!"
	model-index:
	- name: Geez BPE Tokenizer
	results: []
	---

	# Geez Tokenizer (`Hailay/geez-tokenizer`)

	A BPE tokenizer specifically trained for Geez-script languages, including Tigrinya and Amharic. The tokenizer is trained on monolingual corpora and supports morphologically rich low-resource languages.

	## 🧠 Motivation

	Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:

	- Reduce over-segmentation errors
	- Respect morpheme boundaries
	- Improve language understanding for downstream tasks like Machine Translation and QA

	## 📚 Training Details

	- Tokenizer Type: BPE
	- Vocabulary Size: 32,000
	- Pre-tokenizer: Whitespace
	- Normalizer: NFD → Lowercase → StripAccents
	- Special Tokens: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
	- Post-processing: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]`

	## 📁 Files

	- `vocab.json`: Vocabulary file
	- `merges.txt`: Merge rules for BPE
	- `tokenizer.json`: Full tokenizer config
	- `tokenizer_config.json`: Hugging Face-compatible configuration
	- `special_tokens_map.json`: Maps for special tokens

	## 🚀 Usage

	```python
	from transformers import PreTrainedTokenizerFast

	tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")

	text = "የግብፅ አርኪኦሎጂስቶች በሳኩራ ኔክሮፖሊስ ውስጥ የተገኘውን ትልቁን መቃብር አግኝተዋል።"
	tokens = tokenizer.tokenize(text)
	ids = tokenizer.encode(text)

	print("Tokens:", tokens)
	print("Token IDs:", ids)

	## 📊 Intended Use

	This tokenizer is best suited for:

	Low-resource NLP pipelines

	Machine Translation

	Question Answering

	Named Entity Recognition

	Morphological analysis



	✋ #Limitations
	It is optimized for Geez-script languages and might not generalize to others.

	Some compound verbs and morphologically fused words may still require linguistic preprocessing.

	Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.


	✅ #Evaluation
	The tokenizer was evaluated manually on:

	Token coverage of Tigrinya/Amharic corpora

	Morphological preservation

	Reduction of BPE segmentation errors

	Quantitative metrics to be published in an accompanying paper.

	📜 #License
	This tokenizer is licensed under the MIT License.
	📌 #Citation

	@misc{hailay2025geez,
	title={Geʽez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
	author={Teklehaymanot, Hailay},
	year={2025},
	howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
	}