pidgin-wav2vec2-xlsr53
This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on the Nigerian Pidgin dataset. It achieves the following results on the evaluation set:
- Loss: 0.6907
- Wer: 0.3161 (val)
Model description
to be updated
Intended uses & limitations
Intended Uses:
- Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks.
- Academic research on low-resource and creole language ASR.
Known Limitations:
- Performance may degrade with dialectal variation, heavy code-switching, or noisy audio environments.
- Model reflects biases present in the training dataset, which may affect accuracy on underrepresented demographics, phonetic variations or topics.
- May struggle with rare words, numerals, and domain-specific terminology not well represented in the training set.
- Not recommended for high-stakes domains (e.g., legal, medical) without domain-specific retraining/finetuning.
Training and evaluation data
The model was fine-tuned on the Nigerian Pidgin ASR v1.0 dataset, consisting of over 4,200 utterances recorded by 10 native speakers (balanced across gender and age) using the LIG-Aikuma mobile platform. Recordings were collected in controlled environments to ensure high-quality audio. Performance: WER 7.4%(train), 31.6% (validation) / 29.6% (test), exceeding baseline benchmarks like QuartzNet and zero-shot XLSR. This results demonstrate the effectiveness of targeted fine-tuning for low-resource ASR.
Training procedure
We fine-tuned the facebook/wav2vec2-large-xlsr-53 model using the Nigerian Pidgin ASR dataset, following the methodology outlined in the XLSR-53 paper. Training was performed on a single NVIDIA A100 GPU using the Hugging Face transformers library with fp16 mixed precision to accelerate computation and reduce memory usage.
A key modification from the standard setup was unfreezing the feature encoder during fine-tuning. This adjustment yielded improved performance, lowering word error rates (WER) on both validation and test sets compared to the frozen-encoder approach.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-4
- train_batch_size: 4
- eval_batch_size: 4
- seed: 3407
- gradient_accumulation_steps: 2
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 30
- mixed_precision_training: Native AMP
This configuration balanced training stability, efficiency, and accuracy, allowing the model to adapt effectively to Nigerian Pidgin speech patterns despite the dataset’s limited size
Perfomance Comparision for Frozen Encoder and Unfrozen Encoder:
Encoder State | Val WER | Test WER |
---|---|---|
Frozen | 0.332 | 0.436 |
Unfrozen | 0.3161 | 0.296 |
Training results(Unfrozen Model)
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
6.604 | 1.48 | 500 | 3.0540 | 1.0 |
3.0176 | 2.95 | 1000 | 3.0035 | 1.0 |
2.1071 | 4.43 | 1500 | 1.0811 | 0.6289 |
1.1143 | 5.91 | 2000 | 0.8348 | 0.5017 |
0.8501 | 7.39 | 2500 | 0.7707 | 0.4352 |
0.7272 | 8.86 | 3000 | 0.7410 | 0.4075 |
0.6038 | 10.34 | 3500 | 0.6283 | 0.3850 |
0.5334 | 11.82 | 4000 | 0.6356 | 0.3701 |
0.4645 | 13.29 | 4500 | 0.6243 | 0.3657 |
0.4251 | 14.77 | 5000 | 0.6838 | 0.3492 |
0.3801 | 16.25 | 5500 | 0.6619 | 0.3445 |
0.3636 | 17.73 | 6000 | 0.6945 | 0.3360 |
0.3366 | 19.2 | 6500 | 0.6108 | 0.3340 |
0.3146 | 20.68 | 7000 | 0.6511 | 0.3273 |
0.3003 | 22.16 | 7500 | 0.6815 | 0.3253 |
0.2783 | 23.63 | 8000 | 0.6761 | 0.3215 |
0.2601 | 25.11 | 8500 | 0.6762 | 0.3187 |
0.2528 | 26.59 | 9000 | 0.6687 | 0.3194 |
0.2409 | 28.06 | 9500 | 0.7064 | 0.3163 |
0.2359 | 29.54 | 10000 | 0.6907 | 0.3161 |
Framework versions
- Transformers 4.37.2
- Pytorch 2.0.1+cu117
- Datasets 2.20.0
- Tokenizers 0.15.2
Citation
@misc{rufai2025endtoendtrainingautomaticspeech, title={Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin}, author={Amina Mardiyyah Rufai and Afolabi Abeeb and Esther Oduntan and Tayo Arulogun and Oluwabukola Adegboro and Daniel Ajisafe}, year={2025}, eprint={2010.11123}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2010.11123}, }
- Downloads last month
- 46
Model tree for asr-nigerian-pidgin/pidgin-wav2vec2-xlsr53
Base model
facebook/wav2vec2-large-xlsr-53