mmbert-base-arabic-nli: Arabic Semantic Embedding Model Fine-Tuned with Triplet and GIST Distillation Loss!
mmbert-base-arabic-nli is a high-quality Sentence Transformer model fine-tuned from jhu-clsp/mmBERT-base with GISTEmbedLoss guided by Arabic-Triplet-Matryoshka-V2.
It maps Arabic sentences and paragraphs into a 768-dimensional semantic space, optimized for semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering tasks.
The model achieves Spearman correlation of 0.8311 on STS benchmarks and supports sequences up to 8192 tokens, making it ideal for long-form Arabic understanding and retrieval-augmented generation (RAG) applications.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: jhu-clsp/mmBERT-base
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Evaluation
MTEB Benchmark Comparison
Performance comparison on Arabic Semantic Textual Similarity (STS17, STS22-v2) benchmarks from the Massive Text Embedding Benchmark (MTEB).
The fine-tuned mmbert-base-arabic-nli demonstrates a significant improvement in semantic understanding across both MTEB benchmarks, achieving up to +30 points over the base mmBERT in cross-domain Arabic sentence similarity.
Semantic Similarity
- Dataset:
sts-dev
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.8259 |
spearman_cosine | 0.8311 |
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# 🔹 Load the model from Hugging Face Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/mmbert-base-arabic-nli")
# 🔹 Example Arabic sentences
sentences = [
"متوسط درجة الحرارة في أورلاندو، فلوريدا.",
"تقع أورلاندو في وسط فلوريدا، ويبلغ متوسط درجة الحرارة الإجمالية فيها 83 درجة فهرنهايت، ومتوسط منخفض يبلغ 62 درجة.",
"تتمتع جنوب فلوريدا بمناخ دافئ ورطب على مدار العام، مما يجعلها وجهة مثالية للعطلات."
]
# 🔹 Generate embeddings
embeddings = model.encode(sentences)
print("Shape of embeddings:", embeddings.shape)
# Output: (3, 768)
# 🔹 Compute pairwise cosine similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([
# [1.0000, 0.8495, 0.7115],
# [0.8495, 1.0000, 0.7436],
# [0.7115, 0.7436, 1.0000]
# ])
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
GISTEmbedLoss
@misc{solatorio2024gistembed,
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
author={Aivin V. Solatorio},
year={2024},
eprint={2402.16829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
- Downloads last month
- 27
Model tree for Omartificial-Intelligence-Space/mmbert-base-arabic-nli
Base model
jhu-clsp/mmBERT-baseCollection including Omartificial-Intelligence-Space/mmbert-base-arabic-nli
Evaluation results
- Pearson Cosine on sts devself-reported0.826
- Spearman Cosine on sts devself-reported0.831