Spaces:
Running
Running
File size: 3,256 Bytes
5665097 0e68577 7e4f85d 5665097 9d94b4f 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 9d94b4f 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 881f336 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 0e68577 26ddb6c 9d94b4f 26ddb6c 0e68577 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
title: Turkish Tokenizer
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer
---
# Turkish Tokenizer
A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.
## Features
- **Morphological Analysis**: Breaks down Turkish words into roots, suffixes, and BPE tokens
- **Visual Tokenization**: Color-coded token display with interactive highlighting
- **Statistics Dashboard**: Detailed analytics including compression ratios and token distribution
- **Real-time Processing**: Instant tokenization with live statistics
- **Example Texts**: Pre-loaded Turkish examples for testing
## How to Use
1. Enter Turkish text in the input field
2. Click "🚀 Tokenize" to process the text
3. View the color-coded tokens in the visualization
4. Check the statistics for detailed analysis
5. See the encoded token IDs and decoded text
## Token Types
- **🔴 Roots (ROOT)**: Base word forms
- **🔵 Suffixes (SUFFIX)**: Turkish grammatical suffixes
- **🟡 BPE**: Byte Pair Encoding tokens for subword units
## Examples
Try these example texts:
- "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
- "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
- "KitapOkumak çok güzeldir ve bilgi verir."
- "Türkiye Cumhuriyeti'nin başkenti Ankara'dır."
- "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor."
## Technical Details
This tokenizer uses:
- Custom morphological analysis for Turkish
- JSON-based vocabulary files
- Gradio for the web interface
- Advanced tokenization algorithms
## Research Paper
This implementation is based on the research paper:
**"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"**
📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292)
**Authors:** M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik
**Abstract:** A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
Please cite this paper if you use this tokenizer in your research:
```bibtex
@article{bayram2024tokens,
title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
journal={arXiv preprint arXiv:2508.14292},
year={2025},
url={https://arxiv.org/abs/2508.14292}
}
```
## Files
- `app.py`: Main Gradio application
- `requirements.txt`: Python dependencies
## Local Development
To run locally:
```bash
pip install -r requirements.txt
python app.py
```
The app will be available at `http://localhost:7860`
## Dependencies
- `gradio`: Web interface framework
- `turkish-tokenizer`: Core tokenization library
## License
This project is open source and available under the MIT License.
|