Update README.md

25a2a49 verified 21 days ago

5.66 kB

	---
	license: apache-2.0
	tags:
	- generated_from_trainer
	- automatic_speech_recognition
	- asr
	- nlp
	- speech_to_text
	- low_resource
	metrics:
	- wer
	base_model: facebook/wav2vec2-large-xlsr-53
	model-index:
	- name: pidgin-wav2vec2-xlsr53
	results: []
	datasets:
	- asr-nigerian-pidgin/nigerian-pidgin-1.0
	pipeline_tag: automatic-speech-recognition
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# pidgin-wav2vec2-xlsr53

	This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the [Nigerian Pidgin](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0) dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.6907
	- Wer: 0.3161 (val)

	## Model description

	to be updated

	## Intended uses & limitations

	Intended Uses:
	- Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks.
	- Academic research on low-resource and creole language ASR.

	Known Limitations:
	- Performance may degrade with dialectal variation, heavy code-switching, or noisy audio environments.
	- Model reflects biases present in the training dataset, which may affect accuracy on underrepresented demographics, phonetic variations or topics.
	- May struggle with rare words, numerals, and domain-specific terminology not well represented in the training set.
	- Not recommended for high-stakes domains (e.g., legal, medical) without domain-specific retraining/finetuning.


	## Training and evaluation data

	The model was fine-tuned on the [Nigerian Pidgin ASR v1.0 dataset](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0), consisting of over 4,200 utterances recorded by 10 native speakers (balanced across gender and age) using the LIG-Aikuma mobile platform. Recordings were collected in controlled environments to ensure high-quality audio.
	Performance: WER 7.4%(train), 31.6% (validation) / 29.6% (test), exceeding baseline benchmarks like QuartzNet and zero-shot XLSR. This results demonstrate the effectiveness of targeted fine-tuning for low-resource ASR.

	## Training procedure
	We fine-tuned the facebook/wav2vec2-large-xlsr-53 model using the Nigerian Pidgin ASR dataset, following the methodology outlined in the XLSR-53 paper. Training was performed on a single NVIDIA A100 GPU using the Hugging Face transformers library with fp16 mixed precision to accelerate computation and reduce memory usage.

	A key modification from the standard setup was unfreezing the feature encoder during fine-tuning. This adjustment yielded improved performance, lowering word error rates (WER) on both validation and test sets compared to the frozen-encoder approach.
	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-4
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 3407
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 8
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 1000
	- num_epochs: 30
	- mixed_precision_training: Native AMP

	This configuration balanced training stability, efficiency, and accuracy, allowing the model to adapt effectively to Nigerian Pidgin speech patterns despite the dataset’s limited size
	### Perfomance Comparision for Frozen Encoder and Unfrozen Encoder:
	\| Encoder State \| Val WER \| Test WER \|
	\| ------------- \| ------- \| -------- \|
	\| Frozen \| 0.332 \| 0.436 \|
	\| Unfrozen \| 0.3161 \| 0.296 \|


	### Training results(Unfrozen Model)

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:------:\|
	\| 6.604 \| 1.48 \| 500 \| 3.0540 \| 1.0 \|
	\| 3.0176 \| 2.95 \| 1000 \| 3.0035 \| 1.0 \|
	\| 2.1071 \| 4.43 \| 1500 \| 1.0811 \| 0.6289 \|
	\| 1.1143 \| 5.91 \| 2000 \| 0.8348 \| 0.5017 \|
	\| 0.8501 \| 7.39 \| 2500 \| 0.7707 \| 0.4352 \|
	\| 0.7272 \| 8.86 \| 3000 \| 0.7410 \| 0.4075 \|
	\| 0.6038 \| 10.34 \| 3500 \| 0.6283 \| 0.3850 \|
	\| 0.5334 \| 11.82 \| 4000 \| 0.6356 \| 0.3701 \|
	\| 0.4645 \| 13.29 \| 4500 \| 0.6243 \| 0.3657 \|
	\| 0.4251 \| 14.77 \| 5000 \| 0.6838 \| 0.3492 \|
	\| 0.3801 \| 16.25 \| 5500 \| 0.6619 \| 0.3445 \|
	\| 0.3636 \| 17.73 \| 6000 \| 0.6945 \| 0.3360 \|
	\| 0.3366 \| 19.2 \| 6500 \| 0.6108 \| 0.3340 \|
	\| 0.3146 \| 20.68 \| 7000 \| 0.6511 \| 0.3273 \|
	\| 0.3003 \| 22.16 \| 7500 \| 0.6815 \| 0.3253 \|
	\| 0.2783 \| 23.63 \| 8000 \| 0.6761 \| 0.3215 \|
	\| 0.2601 \| 25.11 \| 8500 \| 0.6762 \| 0.3187 \|
	\| 0.2528 \| 26.59 \| 9000 \| 0.6687 \| 0.3194 \|
	\| 0.2409 \| 28.06 \| 9500 \| 0.7064 \| 0.3163 \|
	\| 0.2359 \| 29.54 \| 10000 \| 0.6907 \| 0.3161 \|


	### Framework versions

	- Transformers 4.37.2
	- Pytorch 2.0.1+cu117
	- Datasets 2.20.0
	- Tokenizers 0.15.2

	## Citation
	@misc{rufai2025endtoendtrainingautomaticspeech,
	title={Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin},
	author={Amina Mardiyyah Rufai and Afolabi Abeeb and Esther Oduntan and Tayo Arulogun and Oluwabukola Adegboro and Daniel Ajisafe},
	year={2025},
	eprint={2010.11123},
	archivePrefix={arXiv},
	primaryClass={eess.AS},
	url={https://arxiv.org/abs/2010.11123},
	}