--- language: - ar base_model: - sayed0am/arabic-english-bge-m3 tags: - sentence-similarity - sentence-transformers datasets: - castorini/mr-tydi - hsseinmz/arcd - Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset - arbml/Arabic_RC --- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/662294730e805d4fcb06a892/n3whDLHDmEAhbFgYCbhRj.png) # ๐Ÿง  Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval [Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) This is the second version of the [Muffakir_Embedding model](https://huggingface.co/mohamed2811/Muffakir_Embedding). It shows strong performance in **Arabic retrieval-augmented generation (RAG)** and dense retrieval tasks. We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. ๐Ÿš€ --- ## ๐Ÿ” Model Overview * ๐Ÿงฌ **Base model**: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3) * ๐Ÿ“š **Fine-tuning dataset**: \~70,000 Arabic sentence pairs from various topics * ๐Ÿซ **20K** curated from Egyptian legal books * ๐ŸŒ **50K** collected from Hugging Face datasets (multi-domain) * ๐Ÿ‹๏ธ **Training epochs**: 3 * ๐Ÿ“ **Embedding dimension**: 1024 * ๐Ÿ”— **Loss functions**: * [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) * [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings --- ## ๐ŸŒŸ Key Features * ๐Ÿฅ‡ **Strong performance** in **Arabic RAG** and dense retrieval tasks * ๐ŸŽฏ **Multi-resolution embeddings** via Matryoshka (dims: `1024 โ†’ 64`) * ๐ŸŒ Supports **(Arabic)** encoding * ๐Ÿ“ฆ Ready for use in real-world search, Q\&A, and AI agent systems --- ## โš™๏ธ Training Details * ๐Ÿงพ **Dataset size**: 70K examples * ๐Ÿ—‚๏ธ **Topics**: Multi-domain (educational, legal, general knowledge, etc.) * ๐Ÿ” **Epochs**: 3 * ๐Ÿงช **Batch size**: 8 (gradient accumulation enabled) * ๐Ÿš€ **Learning rate**: 2e-5 * ๐Ÿงฐ **Framework**: [sentence-transformers](https://www.sbert.net) --- ## ๐Ÿ“€ Model Specs * ๐Ÿ”ข Embedding size: `1024` * ๐Ÿ”„ Supports Matryoshka-style dimension truncation * ๐Ÿง  Bi-encoder setup, ideal for fast and scalable retrieval tasks --- --- ## ๐Ÿ† Leaderboard Performance * The **Muffakir\_Embedding\_V2** model has achieved notable rankings on the [Arabic RAG Leaderboard](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard), securing: * **5th place** in the **Retrieval** category * These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems. --- ## ๐Ÿงช Example Usage ```python from sentence_transformers import SentenceTransformer import torch # Load the fine-tuned Muffakir model model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2") # Example query and candidate passages query = "ู…ุง ู‡ูŠ ุดุฑูˆุท ุตุญุฉ ุงู„ุนู‚ุฏุŸ" passages = [ "ูŠุดุชุฑุท ุงู„ุชุฑุงุถูŠ ู„ุตุญุฉ ุงู„ุนู‚ุฏ.", "ูŠู†ู‚ุณู… ุงู„ู‚ุงู†ูˆู† ุฅู„ู‰ ุนุงู… ูˆุฎุงุต.", "ุงู„ุนู‚ุฏ ุดุฑูŠุนุฉ ุงู„ู…ุชุนุงู‚ุฏูŠู†.", "ุชู†ุชู‡ูŠ ุงู„ูˆู„ุงูŠุฉ ุงู„ู‚ุงู†ูˆู†ูŠุฉ ุจุจู„ูˆุบ ุณู† ุงู„ุฑุดุฏ." ] # Encode query and passages embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True) embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True) # Compute cosine similarities cosine_scores = torch.matmul(embedding_query, embedding_passages.T) # Get best matching passage best_idx = cosine_scores.argmax().item() best_passage = passages[best_idx] print(f"๐Ÿ” Best matching passage: {best_passage}") ``` ```python @misc{muffakir2025, author = {Mohamed Khaled}, title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval}, year = {2025}, howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}}, } ``` ---