Spaces:
Running
Running
title: Turkish Tokenizer | |
colorFrom: blue | |
colorTo: blue | |
sdk: gradio | |
sdk_version: 5.42.0 | |
app_file: app.py | |
pinned: false | |
license: cc-by-nc-nd-4.0 | |
short_description: Turkish Morphological Tokenizer | |
# Turkish Tokenizer | |
A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction. | |
## Features | |
- **Morphological Analysis**: Breaks down Turkish words into roots, suffixes, and BPE tokens | |
- **Visual Tokenization**: Color-coded token display with interactive highlighting | |
- **Statistics Dashboard**: Detailed analytics including compression ratios and token distribution | |
- **Real-time Processing**: Instant tokenization with live statistics | |
- **Example Texts**: Pre-loaded Turkish examples for testing | |
## How to Use | |
1. Enter Turkish text in the input field | |
2. Click "🚀 Tokenize" to process the text | |
3. View the color-coded tokens in the visualization | |
4. Check the statistics for detailed analysis | |
5. See the encoded token IDs and decoded text | |
## Token Types | |
- **🔴 Roots (ROOT)**: Base word forms | |
- **🔵 Suffixes (SUFFIX)**: Turkish grammatical suffixes | |
- **🟡 BPE**: Byte Pair Encoding tokens for subword units | |
## Examples | |
Try these example texts: | |
- "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir." | |
- "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum." | |
- "KitapOkumak çok güzeldir ve bilgi verir." | |
- "Türkiye Cumhuriyeti'nin başkenti Ankara'dır." | |
- "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor." | |
## Technical Details | |
This tokenizer uses: | |
- Custom morphological analysis for Turkish | |
- JSON-based vocabulary files | |
- Gradio for the web interface | |
- Advanced tokenization algorithms | |
## Research Paper | |
This implementation is based on the research paper: | |
**"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"** | |
📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292) | |
**Authors:** M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik | |
**Abstract:** A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. | |
Please cite this paper if you use this tokenizer in your research: | |
```bibtex | |
@article{bayram2024tokens, | |
title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP}, | |
author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan}, | |
journal={arXiv preprint arXiv:2508.14292}, | |
year={2025}, | |
url={https://arxiv.org/abs/2508.14292} | |
} | |
``` | |
## Files | |
- `app.py`: Main Gradio application | |
- `requirements.txt`: Python dependencies | |
## Local Development | |
To run locally: | |
```bash | |
pip install -r requirements.txt | |
python app.py | |
``` | |
The app will be available at `http://localhost:7860` | |
## Dependencies | |
- `gradio`: Web interface framework | |
- `turkish-tokenizer`: Core tokenization library | |
## License | |
This project is open source and available under the MIT License. | |