mmbert-base-arabic-nli: Arabic Semantic Embedding Model Fine-Tuned with Triplet and GIST Distillation Loss!

poster

Guide Model (Teacher)

mmbert-base-arabic-nli is a high-quality Sentence Transformer model fine-tuned from jhu-clsp/mmBERT-base with GISTEmbedLoss guided by Arabic-Triplet-Matryoshka-V2.

It maps Arabic sentences and paragraphs into a 768-dimensional semantic space, optimized for semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering tasks.

The model achieves Spearman correlation of 0.8311 on STS benchmarks and supports sequences up to 8192 tokens, making it ideal for long-form Arabic understanding and retrieval-augmented generation (RAG) applications.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: jhu-clsp/mmBERT-base
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Evaluation

MTEB Benchmark Comparison

Performance comparison on Arabic Semantic Textual Similarity (STS17, STS22-v2) benchmarks from the Massive Text Embedding Benchmark (MTEB).

MTEB STS17 vs STS22-v2 results

The fine-tuned mmbert-base-arabic-nli demonstrates a significant improvement in semantic understanding across both MTEB benchmarks, achieving up to +30 points over the base mmBERT in cross-domain Arabic sentence similarity.

Semantic Similarity

Dataset: sts-dev
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.8259
spearman_cosine	0.8311

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# 🔹 Load the model from Hugging Face Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/mmbert-base-arabic-nli")

# 🔹 Example Arabic sentences
sentences = [
    "متوسط درجة الحرارة في أورلاندو، فلوريدا.",
    "تقع أورلاندو في وسط فلوريدا، ويبلغ متوسط درجة الحرارة الإجمالية فيها 83 درجة فهرنهايت، ومتوسط منخفض يبلغ 62 درجة.",
    "تتمتع جنوب فلوريدا بمناخ دافئ ورطب على مدار العام، مما يجعلها وجهة مثالية للعطلات."
]

# 🔹 Generate embeddings
embeddings = model.encode(sentences)
print("Shape of embeddings:", embeddings.shape)
# Output: (3, 768)

# 🔹 Compute pairwise cosine similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([
#   [1.0000, 0.8495, 0.7115],
#   [0.8495, 1.0000, 0.7436],
#   [0.7115, 0.7436, 1.0000]
# ])

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

GISTEmbedLoss

@misc{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    year={2024},
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}