Whisper-Large-v3 Dutch - Common Voice Only (Baseline)

This model is a fine-tuned version of openai/whisper-large-v3 for Dutch automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Dutch without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech in ASR training.

Introduction

Purpose

This baseline model demonstrates the performance achievable using only real, crowdsourced speech data from Common Voice 17.0. It serves as a reference point for comparing the effectiveness of synthetic data augmentation approaches, including:

Quality-filtered synthetic data (WAVe-based filtering)
Unfiltered synthetic data augmentation
Different quality thresholds and their impact on ASR performance

Training Approach

The model was fine-tuned from openai/whisper-large-v3 using standard supervised learning on Common Voice 17.0 Dutch:

Real Speech Only: Trained on 34,952 crowdsourced speech samples from Common Voice, with no synthetic augmentation.
Optimization: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 250 with a validation loss of 0.0550.

This baseline achieves strong in-domain performance (4.39% Test WER on Common Voice) but shows limitations in cross-domain generalization (22.43% MLS WER), which synthetic data augmentation helps address.

Model Details

Property	Value
Base Model	openai/whisper-large-v3
Language	Dutch (nl)
Task	Automatic Speech Recognition (transcribe)
Parameters	1550M
Training Data	Common Voice 17.0 Dutch (Real Speech Only)
Total Training Samples	34,952
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-large-v3-cv-only-nl)

Metric	Value
Validation Loss	0.0549
Validation WER	3.56%
Test WER (Common Voice)	4.39%
Test WER (MLS)	22.43%
Best Checkpoint	Step 250
Max Training Steps	680

Comparison with Synthetic Data Augmentation (Whisper-Large-v3 Dutch)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)	MLS Improvement
Common Voice Only (Baseline)	680	0.0549	3.56%	4.39%	22.43%	—
High + Mixed Quality (q ≥ 0.5)	890	0.0520	3.57%	4.43%	20.29%	+9.5%
All Synthetic (Unfiltered)	1,365	0.0560	3.61%	4.44%	17.02%	+24.1%

Key Performance Characteristics

Fastest training: Fewest steps (680) among all configurations
Smallest dataset: Only 34,952 samples (no synthetic augmentation)
Strong in-domain: 4.39% Test WER on Common Voice
Limited cross-domain: 22.43% MLS WER (poorest generalization)
Reference baseline: Establishes performance without synthetic data

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Dutch	34,952	Real crowdsourced speech
Synthetic Data	0	No synthetic augmentation
Total	34,952

Common Voice 17.0 Dutch

Common Voice is Mozilla's open-source, crowdsourced speech dataset:

Recording conditions: Varied (home recordings, different microphones, background noise)
Speaker diversity: Multiple speakers, ages, and accents
Content: Read sentences from various domains
Quality: Human-validated transcriptions

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	5e-6
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Training Curve

Step  100: val_loss = 0.0570
Step  150: val_loss = 0.0554
Step  200: val_loss = 0.0556
Step  250: val_loss = 0.0550 ← Best checkpoint
Step  300: val_loss = 0.0567
Step  400: val_loss = 0.0574
Step  500: val_loss = 0.0613
Step  650: val_loss = 0.0648

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-only-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "nl"
model.generation_config.task = "transcribe"

When to Use This Model

This baseline model is ideal when:

No synthetic data is available: Training on real data only
Maximum training speed required: Fastest convergence (680 steps)
In-domain performance is priority: Strong on Common Voice-like data (4.39% WER)
Comparing augmentation approaches: Reference for measuring synthetic data impact

Consider synthetic-augmented variants for better cross-domain performance:

whisper-large-v3-high-mixed-nl: 9.5% better on MLS, similar in-domain
whisper-large-v3-cv-fully-synthetic-nl: 24.1% better on MLS (best generalization)

Impact of Synthetic Data Augmentation

This baseline enables quantifying the value of synthetic speech:

Metric	CV-Only	+ Synthetic (best)	Improvement
Training Steps	680	1,365	+101%
Dataset Size	34,952	69,850	+100%
Test WER (CV)	4.39%	4.44%	-0.05pp
Test WER (MLS)	22.43%	17.02%	+24.1%

Key insight: Synthetic data augmentation maintains in-domain performance while dramatically improving cross-domain generalization, at the cost of increased training time.

Limitations

Domain specificity: Optimized for Common Voice-style speech; cross-domain performance limited
Acoustic diversity: Limited to Common Voice recording conditions and speaker pool
Data scarcity: No augmentation means model capacity may be underutilized
Generalization: 22.43% MLS WER shows difficulty adapting to different acoustic conditions

Citation

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}