YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

BoKenlm-sp - Tibetan KenLM Language Model

A KenLM n-gram language model trained on Tibetan text, tokenized with sentencepiece tokenizer.

Model Details

Parameter	Value
Model Type	Modified Kneser-Ney 5-gram
Tokenizer	openpecha/BoSentencePiece (Unigram, 20k vocab)
Training Corpus	`bo_corpus.txt`
Pruning	0 0 1
Tokens	42,010,347
Vocabulary Size	20,003

N-gram Statistics

Order	Count	D1	D2	D3+
1	20,003	0.4921	0.3393	1.0317
2	6,945,893	0.6676	1.1495	1.5504
3	4,960,553	0.8443	1.2638	1.4835
4	4,211,842	0.9154	1.3888	1.5332
5	3,276,583	0.8525	1.5142	1.6453

Memory Estimates

Type	MB	Details
probing	425	assuming -p 1.5
probing	517	assuming -r models -p 1.5
trie	211	without quantization
trie	112	assuming -q 8 -b 8 quantization
trie	180	assuming -a 22 array pointer compression
trie	81	assuming -a 22 -q 8 -b 8 array pointer compression and quantization

Training Resources

Metric	Value
Peak Virtual Memory	12,333 MB
Peak RSS	3,578 MB
Wall Time	42.9s
User Time	48.5s
System Time	19.7s

Usage

import kenlm

model = kenlm.Model("BoKenlm-sp.arpa")

# Score a tokenized sentence
score = model.score("▁བོད་སྐད་ ▁ཀྱི་ ▁ཚིག་གྲུབ་ ▁འདི་ ▁ཡིན།")
print(score)

Files

BoKenlm-sp.arpa — ARPA format language model
README.md — This model card

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support