File size: 5,662 Bytes

50ac12e
a068efc
26a1e89
 
d04e397
 
 
 
 
3343a9d
 
 
a068efc
 
 
d04e397
 
 
50ac12e
 
a068efc
 
 
 
 
81a411e
a068efc
 
 
 
81a411e
a068efc
81a411e
a068efc
81a411e
0d38692
81a411e
 
 
0d38692
81a411e
 
 
 
 
0d38692
a068efc
 
 
81a411e
 
a068efc
 
81a411e
a068efc
81a411e
a068efc
 
 
81a411e
a068efc
 
 
 
 
 
 
 
 
 
 
81a411e
 
 
 
 
 
 
 
 
a068efc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef054d1
a068efc
81a411e
ef054d1
 
25a2a49

---
license: apache-2.0
tags:
- generated_from_trainer
- automatic_speech_recognition
- asr
- nlp
- speech_to_text
- low_resource
metrics:
- wer
base_model: facebook/wav2vec2-large-xlsr-53
model-index:
- name: pidgin-wav2vec2-xlsr53
  results: []
datasets:
- asr-nigerian-pidgin/nigerian-pidgin-1.0
pipeline_tag: automatic-speech-recognition
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# pidgin-wav2vec2-xlsr53

This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the [Nigerian Pidgin](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0) dataset.
It achieves the following results on the evaluation set:
- Loss: 0.6907
- Wer: 0.3161 (val)

## Model description

*to be updated*

## Intended uses & limitations

**Intended Uses**:
- Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks. 
- Academic research on low-resource and creole language ASR.

**Known Limitations**:
- Performance may degrade with dialectal variation, heavy code-switching, or noisy audio environments. 
- Model reflects biases present in the training dataset, which may affect accuracy on underrepresented demographics, phonetic variations or topics. 
- May struggle with rare words, numerals, and domain-specific terminology not well represented in the training set. 
- Not recommended for high-stakes domains (e.g., legal, medical) without domain-specific retraining/finetuning. 


## Training and evaluation data

The model was fine-tuned on the [Nigerian Pidgin ASR v1.0 dataset](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0), consisting of over 4,200 utterances recorded by 10 native speakers (balanced across gender and age) using the LIG-Aikuma mobile platform. Recordings were collected in controlled environments to ensure high-quality audio.
Performance: WER 7.4%(train), 31.6% (validation) / 29.6% (test), exceeding baseline benchmarks like QuartzNet and zero-shot XLSR. This results demonstrate the effectiveness of targeted fine-tuning for low-resource ASR.

## Training procedure
We fine-tuned the facebook/wav2vec2-large-xlsr-53 model using the Nigerian Pidgin ASR dataset, following the methodology outlined in the XLSR-53 paper. Training was performed on a single NVIDIA A100 GPU using the Hugging Face transformers library with fp16 mixed precision to accelerate computation and reduce memory usage.

A key modification from the standard setup was unfreezing the feature encoder during fine-tuning. This adjustment yielded improved performance, lowering word error rates (WER) on both validation and test sets compared to the frozen-encoder approach.
### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-4
- train_batch_size: 4
- eval_batch_size: 4
- seed: 3407
- gradient_accumulation_steps: 2
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 30
- mixed_precision_training: Native AMP

This configuration balanced training stability, efficiency, and accuracy, allowing the model to adapt effectively to Nigerian Pidgin speech patterns despite the dataset’s limited size
### Perfomance Comparision for Frozen Encoder and Unfrozen Encoder:
| Encoder State | Val WER | Test WER |
| ------------- | ------- | -------- |
| Frozen        | 0.332   |   0.436  |
| Unfrozen      | 0.3161  |   0.296  |


### Training results(Unfrozen Model)

| Training Loss | Epoch | Step  | Validation Loss | Wer    |
|:-------------:|:-----:|:-----:|:---------------:|:------:|
| 6.604         | 1.48  | 500   | 3.0540          | 1.0    |
| 3.0176        | 2.95  | 1000  | 3.0035          | 1.0    |
| 2.1071        | 4.43  | 1500  | 1.0811          | 0.6289 |
| 1.1143        | 5.91  | 2000  | 0.8348          | 0.5017 |
| 0.8501        | 7.39  | 2500  | 0.7707          | 0.4352 |
| 0.7272        | 8.86  | 3000  | 0.7410          | 0.4075 |
| 0.6038        | 10.34 | 3500  | 0.6283          | 0.3850 |
| 0.5334        | 11.82 | 4000  | 0.6356          | 0.3701 |
| 0.4645        | 13.29 | 4500  | 0.6243          | 0.3657 |
| 0.4251        | 14.77 | 5000  | 0.6838          | 0.3492 |
| 0.3801        | 16.25 | 5500  | 0.6619          | 0.3445 |
| 0.3636        | 17.73 | 6000  | 0.6945          | 0.3360 |
| 0.3366        | 19.2  | 6500  | 0.6108          | 0.3340 |
| 0.3146        | 20.68 | 7000  | 0.6511          | 0.3273 |
| 0.3003        | 22.16 | 7500  | 0.6815          | 0.3253 |
| 0.2783        | 23.63 | 8000  | 0.6761          | 0.3215 |
| 0.2601        | 25.11 | 8500  | 0.6762          | 0.3187 |
| 0.2528        | 26.59 | 9000  | 0.6687          | 0.3194 |
| 0.2409        | 28.06 | 9500  | 0.7064          | 0.3163 |
| 0.2359        | 29.54 | 10000 | 0.6907          | 0.3161 |


### Framework versions

- Transformers 4.37.2
- Pytorch 2.0.1+cu117
- Datasets 2.20.0
- Tokenizers 0.15.2

## Citation
@misc{rufai2025endtoendtrainingautomaticspeech,
      title={Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin}, 
      author={Amina Mardiyyah Rufai and Afolabi Abeeb and Esther Oduntan and Tayo Arulogun and Oluwabukola Adegboro and Daniel Ajisafe},
      year={2025},
      eprint={2010.11123},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2010.11123}, 
}