|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- fa |
|
tags: |
|
- masked-language-modeling |
|
- feature-extraction |
|
- large-scale-dataset |
|
- Persian |
|
- dataset_size:72.9B |
|
- no-next-sentence-prediction |
|
pipeline_tag: fill-mask |
|
|
|
extra_gated_description: >- |
|
You agree to not use the model to conduct experiments that cause harm to |
|
human subjects. |
|
extra_gated_fields: |
|
Full Name: text |
|
Organization (University): text |
|
Email address: text |
|
Country: country |
|
Could you briefly explain the purpose of using the dataset?: text |
|
I agree to use this dataset for non-commercial use ONLY: checkbox |
|
--- |
|
|
|
# Persian Masked Language Model (MLM) |
|
|
|
This model is a **Masked Language Model (MLM)** trained on a **72.9-billion-token corpus** of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance **language understanding tasks** and provide high-quality contextual embeddings for various NLP applications in Persian. |
|
|
|
- **Our Paper:** Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [link](https://arxiv.org/abs/2501.04858) |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Masked Language Model (MLM) |
|
- **Base Model:** XLM-RoBERTa Large |
|
- **Objective:** Predicting randomly masked tokens within sequences |
|
- **Training Corpus Size:** 72.9 billion tokens |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Special Feature:** No Next Sentence Prediction (NSP) task |
|
|
|
## Training Details |
|
|
|
### Training Configuration |
|
- **Hardware:** 8 NVIDIA A800 GPUs |
|
- **Duration:** One week |
|
- **Optimization Framework:** DeepSpeed (Stage 0) |
|
- **Training Parameters:** |
|
- **Learning Rate:** 5e-5 |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Precision:** FP16 (Mixed Precision) |
|
|
|
### Corpus |
|
The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity: |
|
- Web-crawled data |
|
- Academic articles and books |
|
- Persian Wikipedia |
|
- Religious texts |
|
- Social media platforms |
|
|
|
The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data. |
|
|
|
## Usage |
|
|
|
The model can be used for various **downstream NLP tasks** in Persian, including: |
|
- Text classification |
|
- Named entity recognition |
|
- Question answering |
|
- Semantic search |
|
- Contextual embedding generation |
|
|
|
### Example Usage |
|
This model can be loaded and used with the 🤗 Transformers library: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
# Load tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("your_model_id") |
|
model = AutoModelForMaskedLM.from_pretrained("your_model_id") |
|
|
|
# Example text |
|
text = "این یک [MASK] جدید است." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
# Predict the masked token |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
``` |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 30 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- num_devices: 8 |
|
- gradient_accumulation_steps: 2 |
|
- total_train_batch_size: 480 |
|
- total_eval_batch_size: 64 |
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: linear |
|
- num_epochs: 1.0 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.47.0.dev0 |
|
- Pytorch 2.4.1+cu121 |
|
- Datasets 3.0.2 |
|
- Tokenizers 0.20.1 |
|
|
|
## Citation |
|
If you find this model helpful, please ensure to cite the following paper. |
|
|
|
**BibTeX:** |
|
``` |
|
@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian, |
|
title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, |
|
author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi}, |
|
year={2025}, |
|
eprint={2501.04858}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2501.04858}, |
|
} |
|
``` |