NLLB-Darija-FR/ENG - Fine-Tuned Translation Model

This repository contains a specialized translation model for Darija (Moroccan Arabic), French, and English, based on the facebook/nllb-200-distilled-600M model.

The model has been fine-tuned using the LoRA (Low-Rank Adaptation) technique to enhance its translation capabilities for these specific language pairs, which are often underrepresented in generalist models.
The model can translate in both directions (e.g., French to Darija and Darija to French).

This project was developed following a full MLOps approach, including an automated training and deployment pipeline.

🚀 Usage with `transformers`

You can use this model directly with a pipeline from the transformers library.

Installation

Make sure you have the necessary libraries installed:

pip install torch transformers sentencepiece

Example Python Code

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the model and tokenizer from the Hub
model_id = "Farid59/nllb-darija-fr_eng"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Create the translation pipeline
translator = pipeline("translation", model=model, tokenizer=tokenizer)

# --- Example 1: French to Darija ---
texte_fr = "Bonjour, je voudrais réserver une table pour deux personnes ce soir."
traduction_darija = translator(
    texte_fr,
    src_lang="fra_Latn",
    tgt_lang="ary_Arab"
)
print(f"French -> Darija:")
print(f"  Input: {texte_fr}")
print(f"  Output: {traduction_darija[0]['translation_text']}")
# Expected output: أهلا, بغيت نحجز طاولا لشخصين هاد الليلا

# --- Example 2: Darija to English ---
texte_darija = "شحال كايكلف هادشي"
traduction_anglais = translator(
    texte_darija,
    src_lang="ary_Arab",
    tgt_lang="eng_Latn"
)
print(f"\nDarija -> English:")
print(f"  Input: {texte_darija}")
print(f"  Output: {traduction_anglais[0]['translation_text']}")
# Expected output: How much does that cost?

📜 Model Details

Base model: facebook/nllb-200-distilled-600M
Fine-tuning technique: LoRA (Low-Rank Adaptation)
Supported languages:
- fra_Latn (French)
- eng_Latn (English)
- ary_Arab (Darija, Arabic script)

📊 Training Data

The model was trained on a composite corpus assembled from several sources:

The Darija-SFT-Mixture dataset
Data collected by scraping specialized websites
Synthetic data generated to cover tourist and conversational scenarios

All data was cleaned, deduplicated, and formatted for bidirectional training.

⚙️ Training Process

Training was orchestrated by an automated MLOps pipeline using GitHub Actions. The process includes:

Data collection and preparation
Fine-tuning with LoRA on a GPU runner
Merging LoRA adapter weights into the base model to create a standalone model
Evaluation of the merged model on a dedicated test set using the SacreBLEU metric
Validation Gate: The new model is deployed only if its BLEU score exceeds that of the production version

Performance

Fine-tuning resulted in a very significant performance improvement on the dedicated test set.

Model	BLEU Score (on test set)
`facebook/nllb-200-distilled-600M` (Base model)	8.19
`Farid59/nllb-darija-fr_eng` (This fine-tuned model)	18.9

This increase of over 10 BLEU points demonstrates the effectiveness of fine-tuning to specialize the model for the nuances of Darija.

Author

Farid Igouti

This project is part of a portfolio showcasing skills in MLOps, CI/CD, and AI model deployment.

Farid59
/

nllb-darija-fr_eng

NLLB-Darija-FR/ENG - Fine-Tuned Translation Model

🚀 Usage with `transformers`

Installation

Example Python Code

📜 Model Details

📊 Training Data

⚙️ Training Process

Performance

Author

Dataset used to train Farid59/nllb-darija-fr_eng

NLLB-Darija-FR/ENG - Fine-Tuned Translation Model

🚀 Usage with transformers

Installation

Example Python Code

📜 Model Details

📊 Training Data

⚙️ Training Process

Performance

Author

Dataset used to train Farid59/nllb-darija-fr_eng

🚀 Usage with `transformers`