NLLB-Darija-FR/ENG - Fine-Tuned Translation Model

This repository contains a specialized translation model for Darija (Moroccan Arabic), French, and English, based on the facebook/nllb-200-distilled-600M model.

The model has been fine-tuned using the LoRA (Low-Rank Adaptation) technique to enhance its translation capabilities for these specific language pairs, which are often underrepresented in generalist models.
The model can translate in both directions (e.g., French to Darija and Darija to French).

This project was developed following a full MLOps approach, including an automated training and deployment pipeline.

๐Ÿš€ Usage with transformers

You can use this model directly with a pipeline from the transformers library.

Installation

Make sure you have the necessary libraries installed:

pip install torch transformers sentencepiece

Example Python Code

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the model and tokenizer from the Hub
model_id = "Farid59/nllb-darija-fr_eng"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Create the translation pipeline
translator = pipeline("translation", model=model, tokenizer=tokenizer)

# --- Example 1: French to Darija ---
texte_fr = "Bonjour, je voudrais rรฉserver une table pour deux personnes ce soir."
traduction_darija = translator(
    texte_fr,
    src_lang="fra_Latn",
    tgt_lang="ary_Arab"
)
print(f"French -> Darija:")
print(f"  Input: {texte_fr}")
print(f"  Output: {traduction_darija[0]['translation_text']}")
# Expected output: ุฃู‡ู„ุง, ุจุบูŠุช ู†ุญุฌุฒ ุทุงูˆู„ุง ู„ุดุฎุตูŠู† ู‡ุงุฏ ุงู„ู„ูŠู„ุง

# --- Example 2: Darija to English ---
texte_darija = "ุดุญุงู„ ูƒุงูŠูƒู„ู ู‡ุงุฏุดูŠ"
traduction_anglais = translator(
    texte_darija,
    src_lang="ary_Arab",
    tgt_lang="eng_Latn"
)
print(f"\nDarija -> English:")
print(f"  Input: {texte_darija}")
print(f"  Output: {traduction_anglais[0]['translation_text']}")
# Expected output: How much does that cost?

๐Ÿ“œ Model Details

  • Base model: facebook/nllb-200-distilled-600M
  • Fine-tuning technique: LoRA (Low-Rank Adaptation)
  • Supported languages:
    • fra_Latn (French)
    • eng_Latn (English)
    • ary_Arab (Darija, Arabic script)

๐Ÿ“Š Training Data

The model was trained on a composite corpus assembled from several sources:

  • The Darija-SFT-Mixture dataset
  • Data collected by scraping specialized websites
  • Synthetic data generated to cover tourist and conversational scenarios

All data was cleaned, deduplicated, and formatted for bidirectional training.

โš™๏ธ Training Process

Training was orchestrated by an automated MLOps pipeline using GitHub Actions. The process includes:

  1. Data collection and preparation
  2. Fine-tuning with LoRA on a GPU runner
  3. Merging LoRA adapter weights into the base model to create a standalone model
  4. Evaluation of the merged model on a dedicated test set using the SacreBLEU metric
  5. Validation Gate: The new model is deployed only if its BLEU score exceeds that of the production version

Performance

Fine-tuning resulted in a very significant performance improvement on the dedicated test set.

Model BLEU Score (on test set)
facebook/nllb-200-distilled-600M (Base model) 8.19
Farid59/nllb-darija-fr_eng (This fine-tuned model) 18.9

This increase of over 10 BLEU points demonstrates the effectiveness of fine-tuning to specialize the model for the nuances of Darija.

Author

Farid Igouti

This project is part of a portfolio showcasing skills in MLOps, CI/CD, and AI model deployment.

Downloads last month
590
Safetensors
Model size
615M params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train Farid59/nllb-darija-fr_eng