Omartificial-Intelligence-Space's picture
Update README.md
7eec901 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:1000000
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
  - arabic
  - Semantic
base_model: google/embeddinggemma-300m
widget:
  - source_sentence: امرأة شقراء تطل على مشهد (سياتل سبيس نيدل)
    sentences:
      - رجل يستمتع بمناظر جسر البوابة الذهبية
      - فتاة بالخارج تلعب في الثلج
      - شخص ما يأخذ في نظرة إبرة الفضاء.
  - source_sentence: سوق الشرق الأوسط
    sentences:
      - مسرح أمريكي
      - متجر في الشرق الأوسط
      - البالغون صغار
  - source_sentence: رجلين يتنافسان في ملابس فنون الدفاع عن النفس
    sentences:
      - هناك العديد من الناس الحاضرين.
      - الكلب الأبيض على الشاطئ
      - هناك شخص واحد فقط موجود.
  - source_sentence: مجموعة من الناس تمشي بجانب شاحنة.
    sentences:
      - الناس يقفون
      - بعض الناس بالخارج
      - بعض الرجال يقودون على الطريق
  - source_sentence: لاعبة كرة ناعمة ترمي الكرة إلى زميلتها في الفريق
    sentences:
      - شخصان يلعبان كرة البيسبول
      - الرجل ينظف
      - لاعبين لكرة البيسبول يجلسان على مقعد
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: apache-2.0
language:
  - ar

AraGemma-Embedding-300m

image/png

Model Page: AraGemma-Embedding (Hugging Face)

Authors: Google DeepMind (base model), fine-tuned by Omartificial-Intelligence-Space

Find More About: Arabic Semantic Embeddings Models


Simple RAG and Other NLP Tasks Example:

RAG & NLP Tasks Notebook


Model Overview

AraGemma-Embedding-300m is a fine-tuned version of EmbeddingGemma-300M, optimized for Arabic semantic understanding.
This model was fine-tuned using 1 million Arabic triplet pairs (anchor, positive, negative) with Matryoshka Representation Learning (MRL) to enhance semantic similarity, clustering, classification, and retrieval for Arabic texts.

It builds on Google’s Gemma 3 research, making it lightweight, efficient, and deployable on-device (mobile, laptops, desktops) while achieving state-of-the-art Arabic semantic embedding performance.


Model Information

Input

  • Text string (Arabic or multilingual)
  • Maximum context length: 2048 tokens

Output

  • Dense vector representation of size 768
  • Supports MRL truncation to 512, 256, or 128 dimensions with re-normalization

Performance

Benchmark Results

image/png

Significant improvements show stronger semantic Arabic understanding.

Performance with other Arabic Embeddings

Model Dim # Params. STS17 STS22-v2 Average
Arabic-Triplet-Matryoshka-V2 768 135M 85 64 75
Arabert-all-nli-triplet-Matryoshka 768 135M 83 64 74
AraGemma-Embedding-300m 768 303M 84 62 73
GATE-AraBert-V1 767 135M 83 63 73
Marbert-all-nli-triplet-Matryoshka 768 163M 82 61 72
Arabic-labse-Matryoshka 768 471M 82 61 72
AraEuroBert-Small 768 210M 80 61 71
E5-all-nli-triplet-Matryoshka 384 278M 80 60 70
text-embedding-3-large 3072 - 81 59 70
Arabic-all-nli-triplet-Matryoshka 768 135M 82 54 68
AraEuroBert-Mid 1151 610M 83 53 68
paraphrase-multilingual-mpnet-base-v2 768 135M 79 55 67
AraEuroBert-Large 2304 2.1B 79 55 67
text-embedding-ada-002 1536 - 71 62 66
text-embedding-3-small 1536 - 72 57 65

Usage

This model is compatible with Sentence Transformers and Hugging Face Transformers.

from sentence_transformers import SentenceTransformer

# Load the Arabic-optimized embedding model
model = SentenceTransformer("Omartificial-Intelligence-Space/AraGemma-Embedding-300m")

# Example: Arabic semantic similarity
query = "ما هو الكوكب الأحمر؟"
documents = [
    "الزهرة تشبه الأرض في الحجم والقرب.",
    "المريخ يسمى بالكوكب الأحمر بسبب لونه المميز.",
    "المشتري أكبر كواكب المجموعة الشمسية.",
    "زحل يتميز بحلقاته الشهيرة."
]

query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)

# Compute cosine similarities
from torch import cosine_similarity
import torch

query_tensor = torch.tensor(query_embedding)
doc_tensors = torch.tensor(doc_embeddings)
similarities = cosine_similarity(query_tensor.unsqueeze(0), doc_tensors)

print(similarities)

Applications

  • Semantic Chunking for RAG (Retrieval-Augmented Generation)
  • Semantic Search & Retrieval (Arabic focus)
  • Clustering and Classification of Arabic documents
  • Cross-lingual retrieval (multilingual data supported)

Limitations

  • Embedding activations do not support float16 – use float32 or bfloat16.

Citation

If you use this model in your work, please cite:

@misc{AraGemmaEmbedding2025,
  title={AraGemma-Embedding: Fine-tuned EmbeddingGemma for Arabic Semantic Understanding},
  author={Omartificial-Intelligence-Space},
  year={2025},
  url={https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m}
}