Spaces:

alibayram
/

turkish_tiktokenizer

Running

Enhance README with a new section detailing the research paper on the hybrid tokenization approach, including citation information and authors. Update requirements to specify the version of the Turkish Tokenizer package.

881f336 11 days ago

preview code

raw

history blame contribute delete

3.26 kB

	---
	title: Turkish Tokenizer
	colorFrom: blue
	colorTo: blue
	sdk: gradio
	sdk_version: 5.42.0
	app_file: app.py
	pinned: false
	license: cc-by-nc-nd-4.0
	short_description: Turkish Morphological Tokenizer
	---

	# Turkish Tokenizer

	A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.

	## Features

	- Morphological Analysis: Breaks down Turkish words into roots, suffixes, and BPE tokens
	- Visual Tokenization: Color-coded token display with interactive highlighting
	- Statistics Dashboard: Detailed analytics including compression ratios and token distribution
	- Real-time Processing: Instant tokenization with live statistics
	- Example Texts: Pre-loaded Turkish examples for testing

	## How to Use

	1. Enter Turkish text in the input field
	2. Click "🚀 Tokenize" to process the text
	3. View the color-coded tokens in the visualization
	4. Check the statistics for detailed analysis
	5. See the encoded token IDs and decoded text

	## Token Types

	- 🔴 Roots (ROOT): Base word forms
	- 🔵 Suffixes (SUFFIX): Turkish grammatical suffixes
	- 🟡 BPE: Byte Pair Encoding tokens for subword units

	## Examples

	Try these example texts:

	- "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
	- "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
	- "KitapOkumak çok güzeldir ve bilgi verir."
	- "Türkiye Cumhuriyeti'nin başkenti Ankara'dır."
	- "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor."

	## Technical Details

	This tokenizer uses:

	- Custom morphological analysis for Turkish
	- JSON-based vocabulary files
	- Gradio for the web interface
	- Advanced tokenization algorithms

	## Research Paper

	This implementation is based on the research paper:

	"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"

	📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292)

	Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik

	Abstract: A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.

	Please cite this paper if you use this tokenizer in your research:

	```bibtex
	@article{bayram2024tokens,
	title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
	author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
	journal={arXiv preprint arXiv:2508.14292},
	year={2025},
	url={https://arxiv.org/abs/2508.14292}
	}
	```

	## Files

	- `app.py`: Main Gradio application
	- `requirements.txt`: Python dependencies

	## Local Development

	To run locally:

	```bash
	pip install -r requirements.txt
	python app.py
	```

	The app will be available at `http://localhost:7860`

	## Dependencies

	- `gradio`: Web interface framework
	- `turkish-tokenizer`: Core tokenization library

	## License

	This project is open source and available under the MIT License.