NLLB-Darija-FR/ENG - Fine-Tuned Translation Model
This repository contains a specialized translation model for Darija (Moroccan Arabic), French, and English, based on the facebook/nllb-200-distilled-600M
model.
The model has been fine-tuned using the LoRA (Low-Rank Adaptation) technique to enhance its translation capabilities for these specific language pairs, which are often underrepresented in generalist models.
The model can translate in both directions (e.g., French to Darija and Darija to French).
This project was developed following a full MLOps approach, including an automated training and deployment pipeline.
๐ Usage with transformers
You can use this model directly with a pipeline from the transformers
library.
Installation
Make sure you have the necessary libraries installed:
pip install torch transformers sentencepiece
Example Python Code
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
# 1. Load the model and tokenizer from the Hub
model_id = "Farid59/nllb-darija-fr_eng"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
# 2. Create the translation pipeline
translator = pipeline("translation", model=model, tokenizer=tokenizer)
# --- Example 1: French to Darija ---
texte_fr = "Bonjour, je voudrais rรฉserver une table pour deux personnes ce soir."
traduction_darija = translator(
texte_fr,
src_lang="fra_Latn",
tgt_lang="ary_Arab"
)
print(f"French -> Darija:")
print(f" Input: {texte_fr}")
print(f" Output: {traduction_darija[0]['translation_text']}")
# Expected output: ุฃููุง, ุจุบูุช ูุญุฌุฒ ุทุงููุง ูุดุฎุตูู ูุงุฏ ุงููููุง
# --- Example 2: Darija to English ---
texte_darija = "ุดุญุงู ูุงูููู ูุงุฏุดู"
traduction_anglais = translator(
texte_darija,
src_lang="ary_Arab",
tgt_lang="eng_Latn"
)
print(f"\nDarija -> English:")
print(f" Input: {texte_darija}")
print(f" Output: {traduction_anglais[0]['translation_text']}")
# Expected output: How much does that cost?
๐ Model Details
- Base model:
facebook/nllb-200-distilled-600M
- Fine-tuning technique: LoRA (Low-Rank Adaptation)
- Supported languages:
fra_Latn
(French)eng_Latn
(English)ary_Arab
(Darija, Arabic script)
๐ Training Data
The model was trained on a composite corpus assembled from several sources:
- The Darija-SFT-Mixture dataset
- Data collected by scraping specialized websites
- Synthetic data generated to cover tourist and conversational scenarios
All data was cleaned, deduplicated, and formatted for bidirectional training.
โ๏ธ Training Process
Training was orchestrated by an automated MLOps pipeline using GitHub Actions. The process includes:
- Data collection and preparation
- Fine-tuning with LoRA on a GPU runner
- Merging LoRA adapter weights into the base model to create a standalone model
- Evaluation of the merged model on a dedicated test set using the SacreBLEU metric
- Validation Gate: The new model is deployed only if its BLEU score exceeds that of the production version
Performance
Fine-tuning resulted in a very significant performance improvement on the dedicated test set.
Model | BLEU Score (on test set) |
---|---|
facebook/nllb-200-distilled-600M (Base model) |
8.19 |
Farid59/nllb-darija-fr_eng (This fine-tuned model) |
18.9 |
This increase of over 10 BLEU points demonstrates the effectiveness of fine-tuning to specialize the model for the nuances of Darija.
Author
Farid Igouti
This project is part of a portfolio showcasing skills in MLOps, CI/CD, and AI model deployment.
- Downloads last month
- 590