--- tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - generated_from_trainer - dataset_size:1000000 - loss:MatryoshkaLoss - loss:MultipleNegativesRankingLoss - arabic - Semantic base_model: google/embeddinggemma-300m widget: - source_sentence: امرأة شقراء تطل على مشهد (سياتل سبيس نيدل) sentences: - رجل يستمتع بمناظر جسر البوابة الذهبية - فتاة بالخارج تلعب في الثلج - شخص ما يأخذ في نظرة إبرة الفضاء. - source_sentence: سوق الشرق الأوسط sentences: - مسرح أمريكي - متجر في الشرق الأوسط - البالغون صغار - source_sentence: رجلين يتنافسان في ملابس فنون الدفاع عن النفس sentences: - هناك العديد من الناس الحاضرين. - الكلب الأبيض على الشاطئ - هناك شخص واحد فقط موجود. - source_sentence: مجموعة من الناس تمشي بجانب شاحنة. sentences: - الناس يقفون - بعض الناس بالخارج - بعض الرجال يقودون على الطريق - source_sentence: لاعبة كرة ناعمة ترمي الكرة إلى زميلتها في الفريق sentences: - شخصان يلعبان كرة البيسبول - الرجل ينظف - لاعبين لكرة البيسبول يجلسان على مقعد pipeline_tag: sentence-similarity library_name: sentence-transformers license: apache-2.0 language: - ar --- # AraGemma-Embedding-300m ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/bdBAW2_I9e_-pXzStqT6i.png) **Model Page**: [AraGemma-Embedding (Hugging Face)](https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m) **Authors**: [Google DeepMind](https://deepmind.google) (base model), fine-tuned by [Omartificial-Intelligence-Space](https://huggingface.co/Omartificial-Intelligence-Space) **Find More About**: [Arabic Semantic Embeddings Models](https://www.omarai.me/embeddings) --- ## Simple RAG and Other NLP Tasks Example: [RAG & NLP Tasks Notebook](https://colab.research.google.com/drive/1Tmk5XgbvsFOIDYH7M9TwsaQ40qwP4h_O?usp=sharing) --- ## Model Overview **AraGemma-Embedding-300m** is a fine-tuned version of **[EmbeddingGemma-300M](https://ai.google.dev/gemma/docs/embeddinggemma)**, optimized for **Arabic semantic understanding**. This model was fine-tuned using **1 million Arabic triplet pairs** (anchor, positive, negative) with **Matryoshka Representation Learning (MRL)** to enhance semantic similarity, clustering, classification, and retrieval for Arabic texts. It builds on Google’s Gemma 3 research, making it lightweight, efficient, and deployable on-device (mobile, laptops, desktops) while achieving **state-of-the-art Arabic semantic embedding performance**. --- ## Model Information ### Input - Text string (Arabic or multilingual) - Maximum context length: **2048 tokens** ### Output - Dense vector representation of size **768** - Supports **MRL truncation** to 512, 256, or 128 dimensions with re-normalization --- ## Performance ### Benchmark Results ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/kEkJ41nyV3CfhLT00_lfP.png) Significant improvements show stronger **semantic Arabic understanding**. ### Performance with other Arabic Embeddings | Model | Dim | # Params. | STS17 | STS22-v2 | Average | |------------------------------------------|------|-----------|-------|----------|---------| | Arabic-Triplet-Matryoshka-V2 | 768 | 135M | 85 | 64 | 75 | | Arabert-all-nli-triplet-Matryoshka | 768 | 135M | 83 | 64 | 74 | | **AraGemma-Embedding-300m** | 768 | 303M | 84 | 62 | 73 | | GATE-AraBert-V1 | 767 | 135M | 83 | 63 | 73 | | Marbert-all-nli-triplet-Matryoshka | 768 | 163M | 82 | 61 | 72 | | Arabic-labse-Matryoshka | 768 | 471M | 82 | 61 | 72 | | AraEuroBert-Small | 768 | 210M | 80 | 61 | 71 | | E5-all-nli-triplet-Matryoshka | 384 | 278M | 80 | 60 | 70 | | text-embedding-3-large | 3072 | - | 81 | 59 | 70 | | Arabic-all-nli-triplet-Matryoshka | 768 | 135M | 82 | 54 | 68 | | AraEuroBert-Mid | 1151 | 610M | 83 | 53 | 68 | | paraphrase-multilingual-mpnet-base-v2 | 768 | 135M | 79 | 55 | 67 | | AraEuroBert-Large | 2304 | 2.1B | 79 | 55 | 67 | | text-embedding-ada-002 | 1536 | - | 71 | 62 | 66 | | text-embedding-3-small | 1536 | - | 72 | 57 | 65 | --- ## Usage This model is compatible with [Sentence Transformers](https://www.SBERT.net) and Hugging Face [Transformers](https://huggingface.co/docs/transformers/en/index). ```python from sentence_transformers import SentenceTransformer # Load the Arabic-optimized embedding model model = SentenceTransformer("Omartificial-Intelligence-Space/AraGemma-Embedding-300m") # Example: Arabic semantic similarity query = "ما هو الكوكب الأحمر؟" documents = [ "الزهرة تشبه الأرض في الحجم والقرب.", "المريخ يسمى بالكوكب الأحمر بسبب لونه المميز.", "المشتري أكبر كواكب المجموعة الشمسية.", "زحل يتميز بحلقاته الشهيرة." ] query_embedding = model.encode(query) doc_embeddings = model.encode(documents) # Compute cosine similarities from torch import cosine_similarity import torch query_tensor = torch.tensor(query_embedding) doc_tensors = torch.tensor(doc_embeddings) similarities = cosine_similarity(query_tensor.unsqueeze(0), doc_tensors) print(similarities) ``` ## Applications - Semantic Chunking for RAG (Retrieval-Augmented Generation) - Semantic Search & Retrieval (Arabic focus) - Clustering and Classification of Arabic documents - Cross-lingual retrieval (multilingual data supported) ## Limitations - Embedding activations do not support float16 – use float32 or bfloat16. ## Citation If you use this model in your work, please cite: ``` @misc{AraGemmaEmbedding2025, title={AraGemma-Embedding: Fine-tuned EmbeddingGemma for Arabic Semantic Understanding}, author={Omartificial-Intelligence-Space}, year={2025}, url={https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m} } ```