metadata

license: cc-by-nc-4.0
language:
  - fa
tags:
  - masked-language-modeling
  - feature-extraction
  - large-scale-dataset
  - Persian
  - dataset_size:72.9B
  - no-next-sentence-prediction
pipeline_tag: fill-mask
extra_gated_description: >-
  You agree to not use the model to conduct experiments that cause harm to human
  subjects.
extra_gated_fields:
  Full Name: text
  Organization (University): text
  Email address: text
  Country: country
  Could you briefly explain the purpose of using the dataset?: text
  I agree to use this dataset for non-commercial use ONLY: checkbox

Persian Masked Language Model (MLM)

This model is a Masked Language Model (MLM) trained on a 72.9-billion-token corpus of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance language understanding tasks and provide high-quality contextual embeddings for various NLP applications in Persian.

Our Paper: Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization link

Model Details

Model Description

Model Type: Masked Language Model (MLM)
Base Model: XLM-RoBERTa Large
Objective: Predicting randomly masked tokens within sequences
Training Corpus Size: 72.9 billion tokens
Maximum Sequence Length: 512 tokens
Special Feature: No Next Sentence Prediction (NSP) task

Training Details

Training Configuration

Hardware: 8 NVIDIA A800 GPUs
Duration: One week
Optimization Framework: DeepSpeed (Stage 0)
Training Parameters:
- Learning Rate: 5e-5
- Maximum Sequence Length: 512 tokens
- Precision: FP16 (Mixed Precision)

Corpus

The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity:

Web-crawled data
Academic articles and books
Persian Wikipedia
Religious texts
Social media platforms

The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data.

Usage

The model can be used for various downstream NLP tasks in Persian, including:

Text classification
Named entity recognition
Question answering
Semantic search
Contextual embedding generation

Example Usage

This model can be loaded and used with the 🤗 Transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your_model_id")
model = AutoModelForMaskedLM.from_pretrained("your_model_id")

# Example text
text = "این یک [MASK] جدید است."
inputs = tokenizer(text, return_tensors="pt")

# Predict the masked token
outputs = model(**inputs)
logits = outputs.logits

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 30
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 480
total_eval_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 1.0
mixed_precision_training: Native AMP

Framework versions

Transformers 4.47.0.dev0
Pytorch 2.4.1+cu121
Datasets 3.0.2
Tokenizers 0.20.1

Citation

If you find this model helpful, please ensure to cite the following paper.

BibTeX:

@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian,
      title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, 
      author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi},
      year={2025},
      eprint={2501.04858},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.04858}, 
}