Omartificial-Intelligence-Space's picture

Update README.md

7eec901 verified about 1 month ago

6.85 kB

metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:1000000
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
  - arabic
  - Semantic
base_model: google/embeddinggemma-300m
widget:
  - source_sentence: امرأة شقراء تطل على مشهد (سياتل سبيس نيدل)
    sentences:
      - رجل يستمتع بمناظر جسر البوابة الذهبية
      - فتاة بالخارج تلعب في الثلج
      - شخص ما يأخذ في نظرة إبرة الفضاء.
  - source_sentence: سوق الشرق الأوسط
    sentences:
      - مسرح أمريكي
      - متجر في الشرق الأوسط
      - البالغون صغار
  - source_sentence: رجلين يتنافسان في ملابس فنون الدفاع عن النفس
    sentences:
      - هناك العديد من الناس الحاضرين.
      - الكلب الأبيض على الشاطئ
      - هناك شخص واحد فقط موجود.
  - source_sentence: مجموعة من الناس تمشي بجانب شاحنة.
    sentences:
      - الناس يقفون
      - بعض الناس بالخارج
      - بعض الرجال يقودون على الطريق
  - source_sentence: لاعبة كرة ناعمة ترمي الكرة إلى زميلتها في الفريق
    sentences:
      - شخصان يلعبان كرة البيسبول
      - الرجل ينظف
      - لاعبين لكرة البيسبول يجلسان على مقعد
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: apache-2.0
language:
  - ar

AraGemma-Embedding-300m

Model Page: AraGemma-Embedding (Hugging Face)

Authors: Google DeepMind (base model), fine-tuned by Omartificial-Intelligence-Space

Find More About: Arabic Semantic Embeddings Models

Simple RAG and Other NLP Tasks Example:

RAG & NLP Tasks Notebook

Model Overview

AraGemma-Embedding-300m is a fine-tuned version of EmbeddingGemma-300M, optimized for Arabic semantic understanding.
This model was fine-tuned using 1 million Arabic triplet pairs (anchor, positive, negative) with Matryoshka Representation Learning (MRL) to enhance semantic similarity, clustering, classification, and retrieval for Arabic texts.

It builds on Google’s Gemma 3 research, making it lightweight, efficient, and deployable on-device (mobile, laptops, desktops) while achieving state-of-the-art Arabic semantic embedding performance.

Model Information

Input

Text string (Arabic or multilingual)
Maximum context length: 2048 tokens

Output

Dense vector representation of size 768
Supports MRL truncation to 512, 256, or 128 dimensions with re-normalization

Performance

Benchmark Results

Significant improvements show stronger semantic Arabic understanding.

Performance with other Arabic Embeddings

Model	Dim	# Params.	STS17	STS22-v2	Average
Arabic-Triplet-Matryoshka-V2	768	135M	85	64	75
Arabert-all-nli-triplet-Matryoshka	768	135M	83	64	74
AraGemma-Embedding-300m	768	303M	84	62	73
GATE-AraBert-V1	767	135M	83	63	73
Marbert-all-nli-triplet-Matryoshka	768	163M	82	61	72
Arabic-labse-Matryoshka	768	471M	82	61	72
AraEuroBert-Small	768	210M	80	61	71
E5-all-nli-triplet-Matryoshka	384	278M	80	60	70
text-embedding-3-large	3072	-	81	59	70
Arabic-all-nli-triplet-Matryoshka	768	135M	82	54	68
AraEuroBert-Mid	1151	610M	83	53	68
paraphrase-multilingual-mpnet-base-v2	768	135M	79	55	67
AraEuroBert-Large	2304	2.1B	79	55	67
text-embedding-ada-002	1536	-	71	62	66
text-embedding-3-small	1536	-	72	57	65

Usage

This model is compatible with Sentence Transformers and Hugging Face Transformers.

from sentence_transformers import SentenceTransformer

# Load the Arabic-optimized embedding model
model = SentenceTransformer("Omartificial-Intelligence-Space/AraGemma-Embedding-300m")

# Example: Arabic semantic similarity
query = "ما هو الكوكب الأحمر؟"
documents = [
    "الزهرة تشبه الأرض في الحجم والقرب.",
    "المريخ يسمى بالكوكب الأحمر بسبب لونه المميز.",
    "المشتري أكبر كواكب المجموعة الشمسية.",
    "زحل يتميز بحلقاته الشهيرة."
]

query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)

# Compute cosine similarities
from torch import cosine_similarity
import torch

query_tensor = torch.tensor(query_embedding)
doc_tensors = torch.tensor(doc_embeddings)
similarities = cosine_similarity(query_tensor.unsqueeze(0), doc_tensors)

print(similarities)

Applications

Semantic Chunking for RAG (Retrieval-Augmented Generation)
Semantic Search & Retrieval (Arabic focus)
Clustering and Classification of Arabic documents
Cross-lingual retrieval (multilingual data supported)

Limitations

Embedding activations do not support float16 – use float32 or bfloat16.

Citation

If you use this model in your work, please cite:

@misc{AraGemmaEmbedding2025,
  title={AraGemma-Embedding: Fine-tuned EmbeddingGemma for Arabic Semantic Understanding},
  author={Omartificial-Intelligence-Space},
  year={2025},
  url={https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m}
}