File size: 5,779 Bytes
1521bfa 69a6deb f25adcb 69a6deb 6aa85d9 6b8b648 6aa85d9 6b8b648 6aa85d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
language:
- fa
library_name: transformers
widget:
- text: "ز سوزناکی گفتار من [MASK] بگریست"
example_title: "Poetry 1"
- text: "نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری"
example_title: "Poetry 2"
- text: "هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را"
example_title: "Poetry 3"
- text: "غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند"
example_title: "Poetry 4"
- text: "این [MASK] اولشه."
example_title: "Informal 1"
- text: "دیگه خسته شدم! [MASK] اینم شد کار؟!"
example_title: "Informal 2"
- text: "فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم."
example_title: "Informal 3"
- text: "تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم."
example_title: "Informal 4"
- text: "زندگی بدون [MASK] خستهکننده است."
example_title: "Formal 1"
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
example_title: "Formal 2"
---
# FaBERT: Pre-training BERT on Persian Blogs
## Model Details
FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.
## Features
- Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
- Remarkable performance across various downstream NLP tasks
- BERT architecture with 124 million parameters
## Useful Links
- **Repository:** [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT)
- **Paper:** [ACL Anthology](https://aclanthology.org/2025.wnut-1.10/)
## Usage
### Loading the Model with MLM head
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
```
### Downstream Tasks
Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)
Examples on Persian datasets are available in our [GitHub repository](#useful-links).
**make sure to use the default Fast Tokenizer**
## Training Details
FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.
| Hyperparameter | Value |
|-------------------|:--------------:|
| Batch Size | 32 |
| Optimizer | Adam |
| Learning Rate | 6e-5 |
| Weight Decay | 0.01 |
| Total Steps | 18 Million |
| Warmup Steps | 1.8 Million |
| Precision Format | TF32 |
## Evaluation
Here are some key performance results for the FaBERT model:
**Sentiment Analysis**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| MirasOpinion | **87.51** | 86.73 | 84.92 |
| MirasIrony | 74.82 | 71.08 | **75.51** |
| DeepSentiPers | **79.85** | 74.94 | 79.00 |
**Named Entity Recognition**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| PEYMA | **91.39** | 91.24 | 90.91 |
| ParsTwiner | **82.22** | 81.13 | 79.50 |
| MultiCoNER v2 | 57.92 | **58.09** | 51.47 |
**Question Answering**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| ParsiNLU | **55.87** | 44.89 | 42.55 |
| PQuAD | 87.34 | 86.89 | **87.60** |
| PCoQA | **53.51** | 50.96 | 51.12 |
**Natural Language Inference & QQP**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| FarsTail | **84.45** | 82.52 | 83.50 |
| SBU-NLI | **66.65** | 58.41 | 58.85 |
| ParsiNLU QQP | **82.62** | 77.60 | 79.74 |
**Number of Parameters**
| | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| Parameter Count (M) | 124 | 162 | 278 |
| Vocabulary Size (K) | 50 | 100 | 250 |
For a more detailed performance analysis refer to the paper.
## How to Cite
If you use FaBERT in your research or projects, please cite it using the following BibTeX:
```bibtex
@inproceedings{masumi-etal-2025-fabert,
title = "{F}a{BERT}: Pre-training {BERT} on {P}ersian Blogs",
author = "Masumi, Mostafa and
Majd, Seyed Soroush and
Shamsfard, Mehrnoush and
Beigy, Hamid",
editor = "Bak, JinYeong and
Goot, Rob van der and
Jang, Hyeju and
Buaphet, Weerayut and
Ramponi, Alan and
Xu, Wei and
Ritter, Alan",
booktitle = "Proceedings of the Tenth Workshop on Noisy and User-generated Text",
month = may,
year = "2025",
address = "Albuquerque, New Mexico, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.wnut-1.10/",
doi = "10.18653/v1/2025.wnut-1.10",
pages = "85--96",
ISBN = "979-8-89176-232-9",
}
```
|