| license: mit | |
| tags: | |
| - tokenizer | |
| - sentencepiece | |
| - monolingual | |
| - guj | |
| - vocab-128000 | |
| # Monolingual Tokenizer - Gujarati (Vocab 128000) | |
| This is a monolingual tokenizer trained on Gujarati text with vocabulary size 128000. | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("monolingual-tokenizer-iso-guj-vocab-128000") | |
| ``` | |
| ## Files | |
| - `guj.model`: SentencePiece model file | |
| - `guj.vocab`: Vocabulary file | |
| - `config.json`: Tokenizer configuration | |
| ## Training Details | |
| - Language: Gujarati (guj) | |
| - Vocabulary Size: 128000 | |
| - Model Type: SentencePiece Unigram | |