Update README.md

9ec126e verified 12 days ago

7.9 kB

	---
	license: apache-2.0
	language:
	- nl
	base_model: openai/whisper-small
	tags:
	- automatic-speech-recognition
	- whisper
	- dutch
	- speech
	- audio
	- asr
	- hf-asr-leaderboard
	datasets:
	- mozilla-foundation/common_voice_17_0
	model-index:
	- name: whisper-small-cv-only-nl
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Common Voice 17.0 (Dutch)
	type: mozilla-foundation/common_voice_17_0
	config: nl
	split: test
	metrics:
	- type: wer
	value: 11.13
	name: Test WER
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Multilingual LibriSpeech (Dutch)
	type: facebook/multilingual_librispeech
	config: dutch
	split: test
	metrics:
	- type: wer
	value: 30.71
	name: Test WER (MLS)
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	---

	# Whisper-Small Dutch - Common Voice Only (Baseline)

	This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) for Dutch automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Dutch without any synthetic data augmentation, serving as a baseline for comparison with synthetic-augmented models.

	## Introduction

	### Purpose

	This model serves as the baseline for evaluating the effectiveness of synthetic data augmentation in Dutch ASR. By training only on real speech data from Common Voice 17.0, we establish reference performance metrics against which synthetic-augmented models can be compared.

	### How the Model Was Created

	The model was fine-tuned from `openai/whisper-small` using the Hugging Face Transformers library:

	1. Training Data: 34,952 real speech samples from Common Voice 17.0 Dutch (train split).

	2. Optimization: Trained for 5 epochs with a learning rate of 1e-5, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.

	3. Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 400 with a validation loss of 0.1492.

	This baseline achieves 11.13% WER on the Common Voice test set, which synthetic-augmented models improve upon by up to 2.4% relative.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| openai/whisper-small \|
	\| Language \| Dutch (nl) \|
	\| Task \| Automatic Speech Recognition (transcribe) \|
	\| Parameters \| 244M \|
	\| Training Data \| Common Voice 17.0 Dutch only \|
	\| Total Training Samples \| 34,952 \|
	\| Sampling Rate \| 16kHz \|

	## Evaluation Results

	### This Model (whisper-small-cv-only-nl)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Validation Loss \| 0.1491 \|
	\| Validation WER \| 8.73% \|
	\| Test WER (Common Voice) \| 11.13% \|
	\| Test WER (MLS) \| 30.71% \|
	\| Best Checkpoint \| Step 400 \|
	\| Max Training Steps \| 680 \|

	### Comparison with Synthetic-Augmented Models (Whisper-Small Dutch)

	\| Training Data \| Max Steps \| Val Loss \| Val WER \| Test WER (CV) \| Test WER (MLS) \|
	\|---------------\|-----------\|----------\|---------\|---------------\|----------------\|
	\| Common Voice Only \| 680 \| 0.1491 \| 8.73% \| 11.13% \| 30.71% \|
	\| High-Quality Filtered + CV \| 890 \| 0.1493 \| 8.76% \| 11.00% \| 29.91% \|
	\| Mid-High Quality Filtered + CV \| 1,270 \| 0.1484 \| 8.73% \| 10.86% \| 30.04% \|
	\| All Synthetic + CV (Unfiltered) \| 1,365 \| 0.1484 \| 8.64% \| 10.91% \| 30.06% \|

	### Key Observations

	- Baseline performance: 11.13% Test WER on Common Voice, 30.71% on MLS
	- Fastest training: Only 680 max steps (smallest dataset)
	- Room for improvement: Synthetic augmentation reduces Test WER by up to 0.27% absolute (2.4% relative)
	- Cross-domain gap: 19.58% absolute difference between CV and MLS performance highlights domain mismatch

	## Training Data

	### Dataset

	\| Source \| Samples \| Description \|
	\|--------\|---------\|-------------\|
	\| [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| 34,952 \| Real speech from Mozilla's crowdsourced dataset \|

	Common Voice 17.0 Dutch contains crowdsourced voice recordings from volunteer contributors reading text prompts. The dataset provides diverse speaker demographics but is limited in acoustic conditions and speaking styles.

	## Training Procedure

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 1e-5 \|
	\| Batch Size (Global) \| 256 \|
	\| Warmup Steps \| 200 \|
	\| Max Epochs \| 5 \|
	\| Precision \| BF16 \|
	\| Optimizer \| AdamW (fused) \|
	\| Eval Steps \| 50 \|
	\| Metric for Best Model \| eval_loss \|

	### Training Infrastructure

	- GPU: NVIDIA H200 (140GB VRAM)
	- Operating System: Ubuntu 22.04
	- Framework: Hugging Face Transformers

	### Training Curve

	```
	Step 100: val_loss = 0.1754
	Step 200: val_loss = 0.1563
	Step 300: val_loss = 0.1514
	Step 400: val_loss = 0.1492 ← Best checkpoint
	Step 500: val_loss = 0.1516
	Step 650: val_loss = 0.1533
	```

	## Usage

	### Transcription Pipeline

	```python
	from transformers import pipeline

	transcriber = pipeline(
	"automatic-speech-recognition",
	model="yuriyvnv/whisper-small-cv-only-nl",
	device="cuda"
	)

	result = transcriber("path/to/dutch_audio.wav")
	print(result["text"])
	```

	### Direct Model Usage

	```python
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	import librosa

	processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-cv-only-nl")
	model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-cv-only-nl")
	model.to("cuda")

	audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
	input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	print(transcription)
	```

	### Specifying Language

	```python
	model.generation_config.language = "nl"
	model.generation_config.task = "transcribe"
	```

	## When to Use This Model

	This model is ideal for:
	- Baseline comparisons: Evaluating the impact of synthetic data augmentation
	- Real-data-only requirements: When synthetic data usage is not permitted
	- Minimal training: Fastest training time among all configurations

	For better performance, consider the synthetic-augmented variants:
	- [whisper-small-high-mixed-nl](https://huggingface.co/yuriyvnv/whisper-small-high-mixed-nl): +0.13% absolute improvement, best MLS performance
	- [whisper-small-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-small-mixed-cv-nl): +0.27% absolute improvement, best CV performance

	## Limitations

	- No synthetic augmentation: Does not benefit from additional acoustic diversity
	- Domain specificity: Trained only on Common Voice; limited generalization to other domains
	- Cross-domain performance: Significant performance drop on MLS benchmark (30.71% vs 11.13%)
	- Dialect coverage: Performance may vary across Dutch regional variants

	## Citation

	```bibtex
	@article{perezhohin2024enhancing,
	title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
	author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
	journal={IEEE Access},
	year={2024},
	publisher={IEEE}
	}
	```

	## References

	- Base Model: [openai/whisper-small](https://huggingface.co/openai/whisper-small)
	- Training Data: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
	- Whisper Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
	- IEEE Access Paper: [Enhancing ASR with Semantic Audio Filtering](https://ieeexplore.ieee.org/document/10720758)

	## License

	Apache 2.0