Matina_MLM / README.md
MostafaMasoudi's picture
Update README.md
08effc4 verified
---
license: cc-by-nc-4.0
language:
- fa
tags:
- masked-language-modeling
- feature-extraction
- large-scale-dataset
- Persian
- dataset_size:72.9B
- no-next-sentence-prediction
pipeline_tag: fill-mask
extra_gated_description: >-
You agree to not use the model to conduct experiments that cause harm to
human subjects.
extra_gated_fields:
Full Name: text
Organization (University): text
Email address: text
Country: country
Could you briefly explain the purpose of using the dataset?: text
I agree to use this dataset for non-commercial use ONLY: checkbox
---
# Persian Masked Language Model (MLM)
This model is a **Masked Language Model (MLM)** trained on a **72.9-billion-token corpus** of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance **language understanding tasks** and provide high-quality contextual embeddings for various NLP applications in Persian.
- **Our Paper:** Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [link](https://arxiv.org/abs/2501.04858)
## Model Details
### Model Description
- **Model Type:** Masked Language Model (MLM)
- **Base Model:** XLM-RoBERTa Large
- **Objective:** Predicting randomly masked tokens within sequences
- **Training Corpus Size:** 72.9 billion tokens
- **Maximum Sequence Length:** 512 tokens
- **Special Feature:** No Next Sentence Prediction (NSP) task
## Training Details
### Training Configuration
- **Hardware:** 8 NVIDIA A800 GPUs
- **Duration:** One week
- **Optimization Framework:** DeepSpeed (Stage 0)
- **Training Parameters:**
- **Learning Rate:** 5e-5
- **Maximum Sequence Length:** 512 tokens
- **Precision:** FP16 (Mixed Precision)
### Corpus
The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity:
- Web-crawled data
- Academic articles and books
- Persian Wikipedia
- Religious texts
- Social media platforms
The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data.
## Usage
The model can be used for various **downstream NLP tasks** in Persian, including:
- Text classification
- Named entity recognition
- Question answering
- Semantic search
- Contextual embedding generation
### Example Usage
This model can be loaded and used with the 🤗 Transformers library:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your_model_id")
model = AutoModelForMaskedLM.from_pretrained("your_model_id")
# Example text
text = "این یک [MASK] جدید است."
inputs = tokenizer(text, return_tensors="pt")
# Predict the masked token
outputs = model(**inputs)
logits = outputs.logits
```
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 30
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 480
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 1.0
- mixed_precision_training: Native AMP
### Framework versions
- Transformers 4.47.0.dev0
- Pytorch 2.4.1+cu121
- Datasets 3.0.2
- Tokenizers 0.20.1
## Citation
If you find this model helpful, please ensure to cite the following paper.
**BibTeX:**
```
@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian,
title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization},
author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi},
year={2025},
eprint={2501.04858},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.04858},
}
```