Matina_MLM / README.md
MostafaMasoudi's picture
Update README.md
08effc4 verified
metadata
license: cc-by-nc-4.0
language:
  - fa
tags:
  - masked-language-modeling
  - feature-extraction
  - large-scale-dataset
  - Persian
  - dataset_size:72.9B
  - no-next-sentence-prediction
pipeline_tag: fill-mask
extra_gated_description: >-
  You agree to not use the model to conduct experiments that cause harm to human
  subjects.
extra_gated_fields:
  Full Name: text
  Organization (University): text
  Email address: text
  Country: country
  Could you briefly explain the purpose of using the dataset?: text
  I agree to use this dataset for non-commercial use ONLY: checkbox

Persian Masked Language Model (MLM)

This model is a Masked Language Model (MLM) trained on a 72.9-billion-token corpus of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance language understanding tasks and provide high-quality contextual embeddings for various NLP applications in Persian.

  • Our Paper: Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization link

Model Details

Model Description

  • Model Type: Masked Language Model (MLM)
  • Base Model: XLM-RoBERTa Large
  • Objective: Predicting randomly masked tokens within sequences
  • Training Corpus Size: 72.9 billion tokens
  • Maximum Sequence Length: 512 tokens
  • Special Feature: No Next Sentence Prediction (NSP) task

Training Details

Training Configuration

  • Hardware: 8 NVIDIA A800 GPUs
  • Duration: One week
  • Optimization Framework: DeepSpeed (Stage 0)
  • Training Parameters:
    • Learning Rate: 5e-5
    • Maximum Sequence Length: 512 tokens
    • Precision: FP16 (Mixed Precision)

Corpus

The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity:

  • Web-crawled data
  • Academic articles and books
  • Persian Wikipedia
  • Religious texts
  • Social media platforms

The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data.

Usage

The model can be used for various downstream NLP tasks in Persian, including:

  • Text classification
  • Named entity recognition
  • Question answering
  • Semantic search
  • Contextual embedding generation

Example Usage

This model can be loaded and used with the 🤗 Transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your_model_id")
model = AutoModelForMaskedLM.from_pretrained("your_model_id")

# Example text
text = "این یک [MASK] جدید است."
inputs = tokenizer(text, return_tensors="pt")

# Predict the masked token
outputs = model(**inputs)
logits = outputs.logits

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 30
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 480
  • total_eval_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 1.0
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.47.0.dev0
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.2
  • Tokenizers 0.20.1

Citation

If you find this model helpful, please ensure to cite the following paper.

BibTeX:

@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian,
      title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, 
      author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi},
      year={2025},
      eprint={2501.04858},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.04858}, 
}