learninbit
/

malayalam-llama-2-tokenizer-v0.1

Text Classification

text-generation-inference

Model card Files Files and versions

aaparajit02 commited on Jan 26, 2024

Commit

4470265

·

verified ·

1 Parent(s): eba71b8

Update README.md

Files changed (1) hide show

README.md +16 -1

README.md CHANGED Viewed

@@ -11,4 +11,19 @@ datasets:
 language:
 - ml
 pipeline_tag: text-classification
----

 language:
 - ml
 pipeline_tag: text-classification
+---
+### About
+- This tokenizer was trained using the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. We sample 1.2 million datapoints at random.
+- This was trained using the [SentencePiece](https://github.com/google/sentencepiece) by Google.
+- Then the trained tokens were then added to the `LlamaTokenizer` leading to a total of 49,120 tokens from 32,000 from the original tokenizer.
+- The merging was done according to what the [Chinese-Llama-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py)'s merging did.
+### Usage
+```python
+from transformers import LlamaTokenizer
+tokenizer = LlamaTokenizer.from_pretrained("learninbit/malayalam-llama-2-tokenizer-v0.1")
+text = "ഹനഫസ ഹഫഞ്ചഥ ചകഡു ടെണല ഡൃൊമത്തീഴ ടഞ്ഞഭഞ റദ്ധഷ ഌിപത്മഫഥ ടജ്ജഡ്ഡപ്പെവ പഴുണൊ."
+tokens = tokenizer.tokenizer(text)
+```