FaBERT: Pre-training BERT on Persian Blogs

Model Details

FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

Features

  • Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
  • Remarkable performance across various downstream NLP tasks
  • BERT architecture with 124 million parameters

Useful Links

Usage

Loading the Model with MLM head

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")

Downstream Tasks

Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)

Examples on Persian datasets are available in our GitHub repository.

make sure to use the default Fast Tokenizer

Training Details

FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

Hyperparameter Value
Batch Size 32
Optimizer Adam
Learning Rate 6e-5
Weight Decay 0.01
Total Steps 18 Million
Warmup Steps 1.8 Million
Precision Format TF32

Evaluation

Here are some key performance results for the FaBERT model:

Sentiment Analysis

Task FaBERT ParsBERT XLM-R
MirasOpinion 87.51 86.73 84.92
MirasIrony 74.82 71.08 75.51
DeepSentiPers 79.85 74.94 79.00

Named Entity Recognition

Task FaBERT ParsBERT XLM-R
PEYMA 91.39 91.24 90.91
ParsTwiner 82.22 81.13 79.50
MultiCoNER v2 57.92 58.09 51.47

Question Answering

Task FaBERT ParsBERT XLM-R
ParsiNLU 55.87 44.89 42.55
PQuAD 87.34 86.89 87.60
PCoQA 53.51 50.96 51.12

Natural Language Inference & QQP

Task FaBERT ParsBERT XLM-R
FarsTail 84.45 82.52 83.50
SBU-NLI 66.65 58.41 58.85
ParsiNLU QQP 82.62 77.60 79.74

Number of Parameters

FaBERT ParsBERT XLM-R
Parameter Count (M) 124 162 278
Vocabulary Size (K) 50 100 250

For a more detailed performance analysis refer to the paper.

How to Cite

If you use FaBERT in your research or projects, please cite it using the following BibTeX:

@inproceedings{masumi-etal-2025-fabert,
    title = "{F}a{BERT}: Pre-training {BERT} on {P}ersian Blogs",
    author = "Masumi, Mostafa  and
      Majd, Seyed Soroush  and
      Shamsfard, Mehrnoush  and
      Beigy, Hamid",
    editor = "Bak, JinYeong  and
      Goot, Rob van der  and
      Jang, Hyeju  and
      Buaphet, Weerayut  and
      Ramponi, Alan  and
      Xu, Wei  and
      Ritter, Alan",
    booktitle = "Proceedings of the Tenth Workshop on Noisy and User-generated Text",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.wnut-1.10/",
    doi = "10.18653/v1/2025.wnut-1.10",
    pages = "85--96",
    ISBN = "979-8-89176-232-9",
}
Downloads last month
3,260
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
Examples
Mask token: [MASK]

Model tree for sbunlp/fabert

Finetunes
1 model