Whisper-Small Dutch - Common Voice Only (Baseline)
This model is a fine-tuned version of openai/whisper-small for Dutch automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Dutch without any synthetic data augmentation, serving as a baseline for comparison with synthetic-augmented models.
Introduction
Purpose
This model serves as the baseline for evaluating the effectiveness of synthetic data augmentation in Dutch ASR. By training only on real speech data from Common Voice 17.0, we establish reference performance metrics against which synthetic-augmented models can be compared.
How the Model Was Created
The model was fine-tuned from openai/whisper-small using the Hugging Face Transformers library:
Training Data: 34,952 real speech samples from Common Voice 17.0 Dutch (train split).
Optimization: Trained for 5 epochs with a learning rate of 1e-5, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 400 with a validation loss of 0.1492.
This baseline achieves 11.13% WER on the Common Voice test set, which synthetic-augmented models improve upon by up to 2.4% relative.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-small |
| Language | Dutch (nl) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 244M |
| Training Data | Common Voice 17.0 Dutch only |
| Total Training Samples | 34,952 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-small-cv-only-nl)
| Metric | Value |
|---|---|
| Validation Loss | 0.1491 |
| Validation WER | 8.73% |
| Test WER (Common Voice) | 11.13% |
| Test WER (MLS) | 30.71% |
| Best Checkpoint | Step 400 |
| Max Training Steps | 680 |
Comparison with Synthetic-Augmented Models (Whisper-Small Dutch)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| Common Voice Only | 680 | 0.1491 | 8.73% | 11.13% | 30.71% |
| High-Quality Filtered + CV | 890 | 0.1493 | 8.76% | 11.00% | 29.91% |
| Mid-High Quality Filtered + CV | 1,270 | 0.1484 | 8.73% | 10.86% | 30.04% |
| All Synthetic + CV (Unfiltered) | 1,365 | 0.1484 | 8.64% | 10.91% | 30.06% |
Key Observations
- Baseline performance: 11.13% Test WER on Common Voice, 30.71% on MLS
- Fastest training: Only 680 max steps (smallest dataset)
- Room for improvement: Synthetic augmentation reduces Test WER by up to 0.27% absolute (2.4% relative)
- Cross-domain gap: 19.58% absolute difference between CV and MLS performance highlights domain mismatch
Training Data
Dataset
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Dutch | 34,952 | Real speech from Mozilla's crowdsourced dataset |
Common Voice 17.0 Dutch contains crowdsourced voice recordings from volunteer contributors reading text prompts. The dataset provides diverse speaker demographics but is limited in acoustic conditions and speaking styles.
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-5 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Training Curve
Step 100: val_loss = 0.1754
Step 200: val_loss = 0.1563
Step 300: val_loss = 0.1514
Step 400: val_loss = 0.1492 ← Best checkpoint
Step 500: val_loss = 0.1516
Step 650: val_loss = 0.1533
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-small-cv-only-nl",
device="cuda"
)
result = transcriber("path/to/dutch_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-cv-only-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-cv-only-nl")
model.to("cuda")
audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "nl"
model.generation_config.task = "transcribe"
When to Use This Model
This model is ideal for:
- Baseline comparisons: Evaluating the impact of synthetic data augmentation
- Real-data-only requirements: When synthetic data usage is not permitted
- Minimal training: Fastest training time among all configurations
For better performance, consider the synthetic-augmented variants:
- whisper-small-high-mixed-nl: +0.13% absolute improvement, best MLS performance
- whisper-small-mixed-cv-nl: +0.27% absolute improvement, best CV performance
Limitations
- No synthetic augmentation: Does not benefit from additional acoustic diversity
- Domain specificity: Trained only on Common Voice; limited generalization to other domains
- Cross-domain performance: Significant performance drop on MLS benchmark (30.71% vs 11.13%)
- Dialect coverage: Performance may vary across Dutch regional variants
Citation
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-small
- Training Data: mozilla-foundation/common_voice_17_0
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- IEEE Access Paper: Enhancing ASR with Semantic Audio Filtering
License
Apache 2.0
- Downloads last month
- 25
Model tree for yuriyvnv/whisper-small-cv-only-nl
Base model
openai/whisper-smallDataset used to train yuriyvnv/whisper-small-cv-only-nl
Collection including yuriyvnv/whisper-small-cv-only-nl
Evaluation results
- Test WER on Common Voice 17.0 (Dutch)test set self-reported11.130
- Test WER (MLS) on Multilingual LibriSpeech (Dutch)test set self-reported30.710