Whisper Medium Arabic - Quran Fine-tuned (Full Fine-tuning)

Model Description

This model is a fine-tuned version of openai/whisper-medium specifically optimized for Arabic Quran recitation transcription.

The model was fine-tuned using Full Fine-tuning on a dataset of professional and non professional Quran recitations from MP3Quran and tarteel AI, making it highly effective for transcribing Quranic Arabic speech.

  • Developed by: Fine-tuned model
  • Model type: Automatic Speech Recognition (ASR)
  • Language: Arabic (ar)
  • License: Apache 2.0
  • Base model: openai/whisper-medium
  • Fine-tuning method: Full Fine-tuning

Training Details

Training Data

The dataset consists of Quran recitations by professional reciters from MP3Quran, preprocessed with:

  • Audio normalized to 16kHz mono
  • Text without diacritics (tashkeel removed)
  • Log-mel spectrograms extracted
  • Shuffled to ensure diverse train/val/test splits

Training Hyperparameters

Training Arguments:

  • Batch size per device: 32
  • Gradient accumulation steps: 1
  • Effective batch size: 32
  • Learning rate: 1e-06
  • Warmup steps: 500
  • Number of epochs: 0.01
  • Precision: bf16
  • Optimizer: AdamW (default)
  • Learning rate scheduler: linear with warmup
  • Max generation length: 256

Generation Configuration:

  • Task: Transcription
  • Language: Arabic (forced)
  • No repeat n-gram size: 3
  • Repetition penalty: 2.0

Training Infrastructure

  • Gradient checkpointing: Enabled
  • Mixed precision training: bf16
  • Early stopping: WER threshold 0.03

Evaluation

Test Set Metrics

Metric Value
Word Error Rate (WER) 0.1162
Test Loss 0.0317
Runtime (seconds) 1300.45
Samples per second 9.81

Evaluation Data

The model was evaluated on a held-out test set of 12761 samples from the same distribution as the training data (professional Quran recitations from MP3Quran).

Use limitations and license

  • not allowed for commercial use , only nonprofit is allowed .

Installation

pip install transformers torch torchaudio

Inference Example

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, Audio

# Load model and processor
model_id = "yousifgamalo/quran-s-finetuned"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load and preprocess audio
# Example: Load from a file
import librosa
audio_array, sampling_rate = librosa.load("quran_recitation.wav", sr=16000)

# Process audio
input_features = processor(
    audio_array, 
    sampling_rate=16000, 
    return_tensors="pt"
).input_features.to(device)

# Generate transcription
# The model is configured to output Arabic text automatically
predicted_ids = model.generate(input_features)

# Decode prediction
transcription = processor.batch_decode(
    predicted_ids, 
    skip_special_tokens=True
)[0]

print(f"Transcription: {transcription}")

Using with Pipeline

from transformers import pipeline

# Create ASR pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model="yousifgamalo/quran-s-finetuned",
    device=0 if torch.cuda.is_available() else -1
)

# Transcribe audio
result = pipe("quran_recitation.wav")
print(result["text"])

Limitations and Bias

  • Domain-specific: This model is optimized for Quran recitation and may not perform well on general Arabic speech
  • Professional recordings: Trained on professional reciters from MP3Quran, performance may vary on non-professional recordings
  • No diacritics: The model outputs Arabic text without diacritical marks (tashkeel)
  • Classical Arabic: Optimized for Classical/Quranic Arabic, not Modern Standard Arabic or dialects

Training Procedure Details

Preprocessing

  1. Audio files resampled to 16kHz mono
  2. Log-mel spectrograms extracted using Whisper's feature extractor
  3. Text normalized (Arabic diacritics removed)
  4. Dataset shuffled before splitting to ensure representative distributions
  5. Train/validation/test split: 98%/1%/1%

Full Fine-tuning

This model was trained using full fine-tuning, where all model parameters are updated during training. This provides maximum flexibility but requires more memory and compute resources.

Citation

If you use this model, please cite:

  author = {Yousif H A },
  title = {Whisper Medium - Quran Fine-tuned },
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/yousifgamalo/quran-s-finetuned}}
}

Also cite the original Whisper paper:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Model Card Contact

For questions or issues, please open an issue in the model repository.

Downloads last month
241
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yousifgamalo/quran-s-finetuned

Finetuned
(735)
this model

Dataset used to train yousifgamalo/quran-s-finetuned

Evaluation results

  • Word Error Rate on MP3Quran Professional Recitations
    self-reported
    0.116