Sai5480 commited on
Commit
62b9027
·
verified ·
1 Parent(s): 2942126

Add README for guj tokenizer

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - tokenizer
5
+ - sentencepiece
6
+ - monolingual
7
+ - guj
8
+ - vocab-128000
9
+ ---
10
+
11
+ # Monolingual Tokenizer - Gujarati (Vocab 128000)
12
+
13
+ This is a monolingual tokenizer trained on Gujarati text with vocabulary size 128000.
14
+
15
+ ## Usage
16
+
17
+ ```python
18
+ from transformers import AutoTokenizer
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained("monolingual-tokenizer-iso-guj-vocab-128000")
21
+ ```
22
+
23
+ ## Files
24
+
25
+ - `guj.model`: SentencePiece model file
26
+ - `guj.vocab`: Vocabulary file
27
+ - `config.json`: Tokenizer configuration
28
+
29
+ ## Training Details
30
+
31
+ - Language: Gujarati (guj)
32
+ - Vocabulary Size: 128000
33
+ - Model Type: SentencePiece Unigram