---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:1000000
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
- arabic
- Semantic
base_model: google/embeddinggemma-300m
widget:
- source_sentence: امرأة شقراء تطل على مشهد (سياتل سبيس نيدل)
  sentences:
  - رجل يستمتع بمناظر جسر البوابة الذهبية
  - فتاة بالخارج تلعب في الثلج
  - شخص ما يأخذ في نظرة إبرة الفضاء.
- source_sentence: سوق الشرق الأوسط
  sentences:
  - مسرح أمريكي
  - متجر في الشرق الأوسط
  - البالغون صغار
- source_sentence: رجلين يتنافسان في ملابس فنون الدفاع عن النفس
  sentences:
  - هناك العديد من الناس الحاضرين.
  - الكلب الأبيض على الشاطئ
  - هناك شخص واحد فقط موجود.
- source_sentence: مجموعة من الناس تمشي بجانب شاحنة.
  sentences:
  - الناس يقفون
  - بعض الناس بالخارج
  - بعض الرجال يقودون على الطريق
- source_sentence: لاعبة كرة ناعمة ترمي الكرة إلى زميلتها في الفريق
  sentences:
  - شخصان يلعبان كرة البيسبول
  - الرجل ينظف
  - لاعبين لكرة البيسبول يجلسان على مقعد
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: apache-2.0
language:
- ar
---

# AraGemma-Embedding-300m


![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/bdBAW2_I9e_-pXzStqT6i.png)


**Model Page**: [AraGemma-Embedding (Hugging Face)](https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m)

**Authors**: [Google DeepMind](https://deepmind.google) (base model), fine-tuned by [Omartificial-Intelligence-Space](https://huggingface.co/Omartificial-Intelligence-Space)

**Find More About**: [Arabic Semantic Embeddings Models](https://www.omarai.me/embeddings)

---


## Simple RAG and Other NLP Tasks Example:

[RAG & NLP Tasks Notebook](https://colab.research.google.com/drive/1Tmk5XgbvsFOIDYH7M9TwsaQ40qwP4h_O?usp=sharing)

---

## Model Overview

**AraGemma-Embedding-300m** is a fine-tuned version of **[EmbeddingGemma-300M](https://ai.google.dev/gemma/docs/embeddinggemma)**, optimized for **Arabic semantic understanding**.  
This model was fine-tuned using **1 million Arabic triplet pairs** (anchor, positive, negative) with **Matryoshka Representation Learning (MRL)** to enhance semantic similarity, clustering, classification, and retrieval for Arabic texts.

It builds on Google’s Gemma 3 research, making it lightweight, efficient, and deployable on-device (mobile, laptops, desktops) while achieving **state-of-the-art Arabic semantic embedding performance**.

---

## Model Information

### Input
- Text string (Arabic or multilingual)
- Maximum context length: **2048 tokens**

### Output
- Dense vector representation of size **768**
- Supports **MRL truncation** to 512, 256, or 128 dimensions with re-normalization

---


## Performance

### Benchmark Results


![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/kEkJ41nyV3CfhLT00_lfP.png)

Significant improvements show stronger **semantic Arabic understanding**.

### Performance with other Arabic Embeddings 

| Model                                    | Dim  | # Params. | STS17 | STS22-v2 | Average |
|------------------------------------------|------|-----------|-------|----------|---------|
| Arabic-Triplet-Matryoshka-V2             | 768  | 135M      | 85    | 64       | 75      |
| Arabert-all-nli-triplet-Matryoshka       | 768  | 135M      | 83    | 64       | 74      |
| **AraGemma-Embedding-300m**                  | 768  | 303M      | 84    | 62       | 73      |
| GATE-AraBert-V1                          | 767  | 135M      | 83    | 63       | 73      |
| Marbert-all-nli-triplet-Matryoshka       | 768  | 163M      | 82    | 61       | 72      |
| Arabic-labse-Matryoshka                  | 768  | 471M      | 82    | 61       | 72      |
| AraEuroBert-Small                        | 768  | 210M      | 80    | 61       | 71      |
| E5-all-nli-triplet-Matryoshka            | 384  | 278M      | 80    | 60       | 70      |
| text-embedding-3-large                   | 3072 | -         | 81    | 59       | 70      |
| Arabic-all-nli-triplet-Matryoshka        | 768  | 135M      | 82    | 54       | 68      |
| AraEuroBert-Mid                          | 1151 | 610M      | 83    | 53       | 68      |
| paraphrase-multilingual-mpnet-base-v2    | 768  | 135M      | 79    | 55       | 67      |
| AraEuroBert-Large                        | 2304 | 2.1B      | 79    | 55       | 67      |
| text-embedding-ada-002                   | 1536 | -         | 71    | 62       | 66      |
| text-embedding-3-small                   | 1536 | -         | 72    | 57       | 65      |


---

## Usage

This model is compatible with [Sentence Transformers](https://www.SBERT.net) and Hugging Face [Transformers](https://huggingface.co/docs/transformers/en/index).

```python
from sentence_transformers import SentenceTransformer

# Load the Arabic-optimized embedding model
model = SentenceTransformer("Omartificial-Intelligence-Space/AraGemma-Embedding-300m")

# Example: Arabic semantic similarity
query = "ما هو الكوكب الأحمر؟"
documents = [
    "الزهرة تشبه الأرض في الحجم والقرب.",
    "المريخ يسمى بالكوكب الأحمر بسبب لونه المميز.",
    "المشتري أكبر كواكب المجموعة الشمسية.",
    "زحل يتميز بحلقاته الشهيرة."
]

query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)

# Compute cosine similarities
from torch import cosine_similarity
import torch

query_tensor = torch.tensor(query_embedding)
doc_tensors = torch.tensor(doc_embeddings)
similarities = cosine_similarity(query_tensor.unsqueeze(0), doc_tensors)

print(similarities)
```

## Applications
- Semantic Chunking for RAG (Retrieval-Augmented Generation)
- Semantic Search & Retrieval (Arabic focus)
- Clustering and Classification of Arabic documents
- Cross-lingual retrieval (multilingual data supported)

## Limitations
- Embedding activations do not support float16 – use float32 or bfloat16.

## Citation

If you use this model in your work, please cite:

```
@misc{AraGemmaEmbedding2025,
  title={AraGemma-Embedding: Fine-tuned EmbeddingGemma for Arabic Semantic Understanding},
  author={Omartificial-Intelligence-Space},
  year={2025},
  url={https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m}
}
```