aaparajit02 commited on
Commit
4470265
·
verified ·
1 Parent(s): eba71b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -1
README.md CHANGED
@@ -11,4 +11,19 @@ datasets:
11
  language:
12
  - ml
13
  pipeline_tag: text-classification
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  language:
12
  - ml
13
  pipeline_tag: text-classification
14
+ ---
15
+
16
+ ### About
17
+
18
+ - This tokenizer was trained using the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. We sample 1.2 million datapoints at random.
19
+ - This was trained using the [SentencePiece](https://github.com/google/sentencepiece) by Google.
20
+ - Then the trained tokens were then added to the `LlamaTokenizer` leading to a total of 49,120 tokens from 32,000 from the original tokenizer.
21
+ - The merging was done according to what the [Chinese-Llama-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py)'s merging did.
22
+
23
+ ### Usage
24
+ ```python
25
+ from transformers import LlamaTokenizer
26
+ tokenizer = LlamaTokenizer.from_pretrained("learninbit/malayalam-llama-2-tokenizer-v0.1")
27
+ text = "ഹനഫസ ഹഫഞ്ചഥ ചകഡു ടെണല ഡൃൊമത്തീഴ ടഞ്ഞഭഞ റദ്ധഷ ഌിപത്മഫഥ ടജ്ജഡ്ഡപ്പെവ പഴുണൊ."
28
+ tokens = tokenizer.tokenizer(text)
29
+ ```