DOI

Kaz-RoBERTa (base-sized model)

Model description

Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.

Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
#Out:
# {'score': 0.8131822347640991,
#   'token': 18749,
#   'token_str': ' мағынада',
#   'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
# ...
# ...]

Training data

The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:

  • MDBKD Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
  • Conversational data Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)

Together these datasets weigh 25GB of text.

Training procedure

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s>

Pretraining

The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. MLM probability - 15%, num_attention_heads=12, num_hidden_layers=6.

Citation

If you use Kaz-RoBERTa Conversational, please cite:

Cite as: Beksultan Sagyndyk, Sanzhar Murzakhmetov, Kirill Yakunin. Kaz-RoBERTa Conversational Technical Report. TechRxiv. October 02, 2025.
DOI: 10.36227/techrxiv.175942902.25827042/v1

BibTeX

@misc{Sagyndyk2025KazRobertaConversational,
  title  = {Kaz-RoBERTa Conversational Technical Report},
  author = {Beksultan Sagyndyk and Sanzhar Murzakhmetov and Kirill Yakunin},
  year   = {2025},
  publisher = {TechRxiv},
  doi    = {10.36227/techrxiv.175942902.25827042/v1},
  url    = {https://doi.org/10.36227/techrxiv.175942902.25827042/v1}
}
Downloads last month
6,843
Safetensors
Model size
83.5M params
Tensor type
I64
·
F32
·
Inference Providers NEW
Examples
Mask token: undefined

Model tree for kz-transformers/kaz-roberta-conversational

Adapters
1 model
Finetunes
8 models

Dataset used to train kz-transformers/kaz-roberta-conversational

Space using kz-transformers/kaz-roberta-conversational 1