pidgin-wav2vec2-xlsr53

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on the Nigerian Pidgin dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6907
  • Wer: 0.3161 (val)

Model description

to be updated

Intended uses & limitations

Intended Uses:

  • Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks.
  • Academic research on low-resource and creole language ASR.

Known Limitations:

  • Performance may degrade with dialectal variation, heavy code-switching, or noisy audio environments.
  • Model reflects biases present in the training dataset, which may affect accuracy on underrepresented demographics, phonetic variations or topics.
  • May struggle with rare words, numerals, and domain-specific terminology not well represented in the training set.
  • Not recommended for high-stakes domains (e.g., legal, medical) without domain-specific retraining/finetuning.

Training and evaluation data

The model was fine-tuned on the Nigerian Pidgin ASR v1.0 dataset, consisting of over 4,200 utterances recorded by 10 native speakers (balanced across gender and age) using the LIG-Aikuma mobile platform. Recordings were collected in controlled environments to ensure high-quality audio. Performance: WER 7.4%(train), 31.6% (validation) / 29.6% (test), exceeding baseline benchmarks like QuartzNet and zero-shot XLSR. This results demonstrate the effectiveness of targeted fine-tuning for low-resource ASR.

Training procedure

We fine-tuned the facebook/wav2vec2-large-xlsr-53 model using the Nigerian Pidgin ASR dataset, following the methodology outlined in the XLSR-53 paper. Training was performed on a single NVIDIA A100 GPU using the Hugging Face transformers library with fp16 mixed precision to accelerate computation and reduce memory usage.

A key modification from the standard setup was unfreezing the feature encoder during fine-tuning. This adjustment yielded improved performance, lowering word error rates (WER) on both validation and test sets compared to the frozen-encoder approach.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-4
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 3407
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 30
  • mixed_precision_training: Native AMP

This configuration balanced training stability, efficiency, and accuracy, allowing the model to adapt effectively to Nigerian Pidgin speech patterns despite the dataset’s limited size

Perfomance Comparision for Frozen Encoder and Unfrozen Encoder:

Encoder State Val WER Test WER
Frozen 0.332 0.436
Unfrozen 0.3161 0.296

Training results(Unfrozen Model)

Training Loss Epoch Step Validation Loss Wer
6.604 1.48 500 3.0540 1.0
3.0176 2.95 1000 3.0035 1.0
2.1071 4.43 1500 1.0811 0.6289
1.1143 5.91 2000 0.8348 0.5017
0.8501 7.39 2500 0.7707 0.4352
0.7272 8.86 3000 0.7410 0.4075
0.6038 10.34 3500 0.6283 0.3850
0.5334 11.82 4000 0.6356 0.3701
0.4645 13.29 4500 0.6243 0.3657
0.4251 14.77 5000 0.6838 0.3492
0.3801 16.25 5500 0.6619 0.3445
0.3636 17.73 6000 0.6945 0.3360
0.3366 19.2 6500 0.6108 0.3340
0.3146 20.68 7000 0.6511 0.3273
0.3003 22.16 7500 0.6815 0.3253
0.2783 23.63 8000 0.6761 0.3215
0.2601 25.11 8500 0.6762 0.3187
0.2528 26.59 9000 0.6687 0.3194
0.2409 28.06 9500 0.7064 0.3163
0.2359 29.54 10000 0.6907 0.3161

Framework versions

  • Transformers 4.37.2
  • Pytorch 2.0.1+cu117
  • Datasets 2.20.0
  • Tokenizers 0.15.2

Citation

@misc{rufai2025endtoendtrainingautomaticspeech, title={Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin}, author={Amina Mardiyyah Rufai and Afolabi Abeeb and Esther Oduntan and Tayo Arulogun and Oluwabukola Adegboro and Daniel Ajisafe}, year={2025}, eprint={2010.11123}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2010.11123}, }

Downloads last month
46
Safetensors
Model size
315M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for asr-nigerian-pidgin/pidgin-wav2vec2-xlsr53

Finetuned
(300)
this model

Dataset used to train asr-nigerian-pidgin/pidgin-wav2vec2-xlsr53

Space using asr-nigerian-pidgin/pidgin-wav2vec2-xlsr53 1