File size: 3,256 Bytes
5665097
0e68577
 
 
 
7e4f85d
5665097
 
 
 
 
 
9d94b4f
 
0e68577
26ddb6c
 
 
0e68577
 
 
 
 
26ddb6c
0e68577
26ddb6c
0e68577
 
 
 
 
26ddb6c
0e68577
26ddb6c
0e68577
 
 
26ddb6c
0e68577
26ddb6c
0e68577
26ddb6c
0e68577
 
 
9d94b4f
 
26ddb6c
0e68577
26ddb6c
0e68577
26ddb6c
0e68577
 
 
 
26ddb6c
881f336
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e68577
26ddb6c
0e68577
26ddb6c
 
0e68577
26ddb6c
0e68577
26ddb6c
0e68577
 
 
 
26ddb6c
0e68577
26ddb6c
9d94b4f
 
 
 
 
26ddb6c
 
0e68577
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: Turkish Tokenizer
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer
---

# Turkish Tokenizer

A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.

## Features

- **Morphological Analysis**: Breaks down Turkish words into roots, suffixes, and BPE tokens
- **Visual Tokenization**: Color-coded token display with interactive highlighting
- **Statistics Dashboard**: Detailed analytics including compression ratios and token distribution
- **Real-time Processing**: Instant tokenization with live statistics
- **Example Texts**: Pre-loaded Turkish examples for testing

## How to Use

1. Enter Turkish text in the input field
2. Click "🚀 Tokenize" to process the text
3. View the color-coded tokens in the visualization
4. Check the statistics for detailed analysis
5. See the encoded token IDs and decoded text

## Token Types

- **🔴 Roots (ROOT)**: Base word forms
- **🔵 Suffixes (SUFFIX)**: Turkish grammatical suffixes
- **🟡 BPE**: Byte Pair Encoding tokens for subword units

## Examples

Try these example texts:

- "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
- "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
- "KitapOkumak çok güzeldir ve bilgi verir."
- "Türkiye Cumhuriyeti'nin başkenti Ankara'dır."
- "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor."

## Technical Details

This tokenizer uses:

- Custom morphological analysis for Turkish
- JSON-based vocabulary files
- Gradio for the web interface
- Advanced tokenization algorithms

## Research Paper

This implementation is based on the research paper:

**"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"**

📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292)

**Authors:** M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik

**Abstract:** A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.

Please cite this paper if you use this tokenizer in your research:

```bibtex
@article{bayram2024tokens,
  title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
  author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
  journal={arXiv preprint arXiv:2508.14292},
  year={2025},
  url={https://arxiv.org/abs/2508.14292}
}
```

## Files

- `app.py`: Main Gradio application
- `requirements.txt`: Python dependencies

## Local Development

To run locally:

```bash
pip install -r requirements.txt
python app.py
```

The app will be available at `http://localhost:7860`

## Dependencies

- `gradio`: Web interface framework
- `turkish-tokenizer`: Core tokenization library

## License

This project is open source and available under the MIT License.