pidgin-wav2vec2-xlsr53

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on the Nigerian Pidgin dataset. It achieves the following results on the evaluation set:

Loss: 0.6907
Wer: 0.3161 (val)

Model description

to be updated

Intended uses & limitations

Intended Uses:

Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks.
Academic research on low-resource and creole language ASR.

Known Limitations:

Performance may degrade with dialectal variation, heavy code-switching, or noisy audio environments.
Model reflects biases present in the training dataset, which may affect accuracy on underrepresented demographics, phonetic variations or topics.
May struggle with rare words, numerals, and domain-specific terminology not well represented in the training set.
Not recommended for high-stakes domains (e.g., legal, medical) without domain-specific retraining/finetuning.

Training and evaluation data

The model was fine-tuned on the Nigerian Pidgin ASR v1.0 dataset, consisting of over 4,200 utterances recorded by 10 native speakers (balanced across gender and age) using the LIG-Aikuma mobile platform. Recordings were collected in controlled environments to ensure high-quality audio. Performance: WER 7.4%(train), 31.6% (validation) / 29.6% (test), exceeding baseline benchmarks like QuartzNet and zero-shot XLSR. This results demonstrate the effectiveness of targeted fine-tuning for low-resource ASR.

Training procedure

We fine-tuned the facebook/wav2vec2-large-xlsr-53 model using the Nigerian Pidgin ASR dataset, following the methodology outlined in the XLSR-53 paper. Training was performed on a single NVIDIA A100 GPU using the Hugging Face transformers library with fp16 mixed precision to accelerate computation and reduce memory usage.

A key modification from the standard setup was unfreezing the feature encoder during fine-tuning. This adjustment yielded improved performance, lowering word error rates (WER) on both validation and test sets compared to the frozen-encoder approach.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-4
train_batch_size: 4
eval_batch_size: 4
seed: 3407
gradient_accumulation_steps: 2
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 30
mixed_precision_training: Native AMP

This configuration balanced training stability, efficiency, and accuracy, allowing the model to adapt effectively to Nigerian Pidgin speech patterns despite the dataset’s limited size

Perfomance Comparision for Frozen Encoder and Unfrozen Encoder:

Encoder State	Val WER	Test WER
Frozen	0.332	0.436
Unfrozen	0.3161	0.296

Training results(Unfrozen Model)

Training Loss	Epoch	Step	Validation Loss	Wer
6.604	1.48	500	3.0540	1.0
3.0176	2.95	1000	3.0035	1.0
2.1071	4.43	1500	1.0811	0.6289
1.1143	5.91	2000	0.8348	0.5017
0.8501	7.39	2500	0.7707	0.4352
0.7272	8.86	3000	0.7410	0.4075
0.6038	10.34	3500	0.6283	0.3850
0.5334	11.82	4000	0.6356	0.3701
0.4645	13.29	4500	0.6243	0.3657
0.4251	14.77	5000	0.6838	0.3492
0.3801	16.25	5500	0.6619	0.3445
0.3636	17.73	6000	0.6945	0.3360
0.3366	19.2	6500	0.6108	0.3340
0.3146	20.68	7000	0.6511	0.3273
0.3003	22.16	7500	0.6815	0.3253
0.2783	23.63	8000	0.6761	0.3215
0.2601	25.11	8500	0.6762	0.3187
0.2528	26.59	9000	0.6687	0.3194
0.2409	28.06	9500	0.7064	0.3163
0.2359	29.54	10000	0.6907	0.3161

Framework versions

Transformers 4.37.2
Pytorch 2.0.1+cu117
Datasets 2.20.0
Tokenizers 0.15.2

Citation

@misc{rufai2025endtoendtrainingautomaticspeech, title={Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin}, author={Amina Mardiyyah Rufai and Afolabi Abeeb and Esther Oduntan and Tayo Arulogun and Oluwabukola Adegboro and Daniel Ajisafe}, year={2025}, eprint={2010.11123}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2010.11123}, }

asr-nigerian-pidgin
/

pidgin-wav2vec2-xlsr53