--- language: - ar license: apache-2.0 base_model: openai/whisper-medium tags: - whisper - arabic - quran - speech-recognition - automatic-speech-recognition datasets: - yousifgamalo/mp3quran metrics: - wer model-index: - name: quran-s-finetuned results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: type: yousifgamalo/mp3quran name: MP3Quran Professional Recitations metrics: - type: wer value: 0.1162 name: Word Error Rate --- # Whisper Medium Arabic - Quran Fine-tuned (Full Fine-tuning) ## Model Description This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) specifically optimized for **Arabic Quran recitation transcription**. The model was fine-tuned using **Full Fine-tuning** on a dataset of professional and non professional Quran recitations from MP3Quran and tarteel AI, making it highly effective for transcribing Quranic Arabic speech. - **Developed by:** Fine-tuned model - **Model type:** Automatic Speech Recognition (ASR) - **Language:** Arabic (ar) - **License:** Apache 2.0 - **Base model:** openai/whisper-medium - **Fine-tuning method:** Full Fine-tuning ## Training Details ### Training Data - **Source Dataset:** [yousifgamalo/mp3quran](https://huggingface.co/datasets/yousifgamalo/mp3quran) - **Processed Dataset:** [yousifgamalo/quran-cleaned-nonprofessional](https://huggingface.co/datasets/yousifgamalo/quran-cleaned-nonprofessional) - **Training samples:** 1250527 - **Validation samples:** 12760 - **Test samples:** 12761 - **Total samples:** 1276048 The dataset consists of Quran recitations by professional reciters from MP3Quran, preprocessed with: - Audio normalized to 16kHz mono - Text without diacritics (tashkeel removed) - Log-mel spectrograms extracted - Shuffled to ensure diverse train/val/test splits ### Training Hyperparameters **Training Arguments:** - Batch size per device: 32 - Gradient accumulation steps: 1 - Effective batch size: 32 - Learning rate: 1e-06 - Warmup steps: 500 - Number of epochs: 0.01 - Precision: bf16 - Optimizer: AdamW (default) - Learning rate scheduler: linear with warmup - Max generation length: 256 **Generation Configuration:** - Task: Transcription - Language: Arabic (forced) - No repeat n-gram size: 3 - Repetition penalty: 2.0 ### Training Infrastructure - **Gradient checkpointing:** Enabled - **Mixed precision training:** bf16 - **Early stopping:** WER threshold 0.03 ## Evaluation ### Test Set Metrics | Metric | Value | |--------|-------| | **Word Error Rate (WER)** | **0.1162** | | Test Loss | 0.0317 | | Runtime (seconds) | 1300.45 | | Samples per second | 9.81 | ### Evaluation Data The model was evaluated on a held-out test set of 12761 samples from the same distribution as the training data (professional Quran recitations from MP3Quran). ## Use limitations and license - not allowed for commercial use , only nonprofit is allowed . ### Installation ```bash pip install transformers torch torchaudio ``` ### Inference Example ```python import torch from transformers import WhisperProcessor, WhisperForConditionalGeneration from datasets import load_dataset, Audio # Load model and processor model_id = "yousifgamalo/quran-s-finetuned" processor = WhisperProcessor.from_pretrained(model_id) model = WhisperForConditionalGeneration.from_pretrained(model_id) # Move to GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) # Load and preprocess audio # Example: Load from a file import librosa audio_array, sampling_rate = librosa.load("quran_recitation.wav", sr=16000) # Process audio input_features = processor( audio_array, sampling_rate=16000, return_tensors="pt" ).input_features.to(device) # Generate transcription # The model is configured to output Arabic text automatically predicted_ids = model.generate(input_features) # Decode prediction transcription = processor.batch_decode( predicted_ids, skip_special_tokens=True )[0] print(f"Transcription: {transcription}") ``` ### Using with Pipeline ```python from transformers import pipeline # Create ASR pipeline pipe = pipeline( "automatic-speech-recognition", model="yousifgamalo/quran-s-finetuned", device=0 if torch.cuda.is_available() else -1 ) # Transcribe audio result = pipe("quran_recitation.wav") print(result["text"]) ``` ## Limitations and Bias - **Domain-specific:** This model is optimized for **Quran recitation** and may not perform well on general Arabic speech - **Professional recordings:** Trained on professional reciters from MP3Quran, performance may vary on non-professional recordings - **No diacritics:** The model outputs Arabic text **without diacritical marks** (tashkeel) - **Classical Arabic:** Optimized for Classical/Quranic Arabic, not Modern Standard Arabic or dialects ## Training Procedure Details ### Preprocessing 1. Audio files resampled to 16kHz mono 2. Log-mel spectrograms extracted using Whisper's feature extractor 3. Text normalized (Arabic diacritics removed) 4. Dataset shuffled before splitting to ensure representative distributions 5. Train/validation/test split: 98%/1%/1% ### Full Fine-tuning This model was trained using **full fine-tuning**, where all model parameters are updated during training. This provides maximum flexibility but requires more memory and compute resources. ## Citation If you use this model, please cite: ```bibtex author = {Yousif H A }, title = {Whisper Medium - Quran Fine-tuned }, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/yousifgamalo/quran-s-finetuned}} } ``` Also cite the original Whisper paper: ```bibtex @article{radford2022whisper, title={Robust Speech Recognition via Large-Scale Weak Supervision}, author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, journal={arXiv preprint arXiv:2212.04356}, year={2022} } ``` ## Model Card Contact For questions or issues, please open an issue in the model repository.