Whisper-Large-v3 Dutch - Full Synthetic Data (Unfiltered)
This model is a fine-tuned version of openai/whisper-large-v3 for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.
Introduction
Purpose
This model uses all available synthetic data without WAVe quality filtering to evaluate the impact of maximum data augmentation for the largest Whisper model. It achieves excellent performance (4.44% Test WER) but requires significantly more training steps than filtered approaches, demonstrating the quality-vs-quantity tradeoff in synthetic data augmentation.
How the Data Was Created
The training data combines real speech from Common Voice 17.0 with the complete synthetic dataset:
Transcript Generation: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.
Speech Synthesis: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
No Quality Filtering: Unlike other models in this series, no WAVe filtering was applied. All 34,898 synthetic samples were used, including those with potential synthesis defects.
How the Model Was Created
The model was fine-tuned from openai/whisper-large-v3 using the Hugging Face Transformers library:
Mixed Training: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with all 34,898 synthetic samples (69,850 total).
Optimization: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 450 with a validation loss of 0.0564.
This approach achieves strong ASR performance but requires 100% more training steps than training on Common Voice only.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-large-v3 |
| Language | Dutch (nl) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 1550M |
| Training Data | Common Voice 17.0 + All Synthetic (Unfiltered) |
| Total Training Samples | 69,850 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-large-v3-cv-fully-synthetic-nl)
| Metric | Value |
|---|---|
| Validation Loss | 0.0560 |
| Validation WER | 3.61% |
| Test WER (Common Voice) | 4.44% |
| Test WER (MLS) | 17.02% |
| Best Checkpoint | Step 450 |
| Max Training Steps | 1,365 |
Comparison with Other Training Configurations (Whisper-Large-v3 Dutch)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| Common Voice Only | 680 | 0.0549 | 3.56% | 4.39% | 22.43% |
| High-Quality Filtered + CV | 890 | 0.0520 | 3.57% | 4.43% | 20.29% |
| Mid-High Quality Filtered + CV | 1,270 | 0.0570 | 3.63% | 4.48% | 17.25% |
| All Synthetic + CV (Unfiltered) | 1,365 | 0.0560 | 3.61% | 4.44% | 17.02% |
Key Performance Highlights
- Best cross-domain generalization on MLS benchmark (17.02% WER)
- Competitive in-domain performance: 4.44% Test WER on Common Voice (within 0.05% of baseline)
- 24.1% relative improvement on MLS vs baseline (17.02% vs 22.43%)
- Tradeoff: Requires 1,365 steps vs 680 for CV-only (100% more compute)
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Dutch | 34,952 | Real speech from Mozilla's crowdsourced dataset |
| Synthetic Transcript NL (all) | 34,898 | Complete TTS audio without filtering |
| Total | 69,850 |
Synthetic Data Generation Pipeline
The synthetic dataset (yuriyvnv/synthetic_transcript_nl) was generated using:
- Transcript Generation: GPT-4o-mini, matching Common Voice word count distribution
- Speech Synthesis: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
- No Filtering: All samples used regardless of quality
Quality Distribution (For Reference)
While this model uses all data, WAVe quality assessment shows the distribution:
| Quality Level | Samples | Percentage | Used in This Model |
|---|---|---|---|
| High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
| Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% | ✓ |
| Low (q < 0.5) | 4,716 | 13.5% | ✓ |
| Total | 34,898 | 100% | All used |
Note: 13.5% of the synthetic data (4,716 samples) would be filtered out by WAVe, but is included in this model's training.
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 5e-6 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Training Curve
Step 100: val_loss = 0.0618
Step 200: val_loss = 0.0592
Step 300: val_loss = 0.0582
Step 450: val_loss = 0.0564 ← Best checkpoint
Step 600: val_loss = 0.0594
Step 800: val_loss = 0.0596
Step 1000: val_loss = 0.0641
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl",
device="cuda"
)
result = transcriber("path/to/dutch_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl")
model.to("cuda")
audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "nl"
model.generation_config.task = "transcribe"
When to Use This Model
This model is ideal when:
- Best cross-domain performance is required: Achieves 17.02% WER on MLS (best among Large-v3 Dutch)
- Compute budget is not a constraint: Requires most training steps (1,365)
- Quality filtering is not available: Uses raw synthetic data
Consider filtered alternatives for better efficiency:
- whisper-large-v3-high-mixed-nl: 35% fewer steps, best MLS performance with filtering
- whisper-large-v3-mixed-cv-nl: 7% fewer steps, competitive performance
Quality vs Quantity Analysis
This model demonstrates the tradeoff between data quantity and quality for Whisper-Large-v3:
| Approach | Synthetic Samples | Training Steps | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|
| CV Only | 0 | 680 | 4.39% | 22.43% |
| High-Quality (q≥0.8) | 10,555 | 890 | 4.43% | 20.29% |
| Mid-High (q≥0.5) | 30,182 | 1,270 | 4.48% | 17.25% |
| Unfiltered (this model) | 34,898 | 1,365 | 4.44% | 17.02% |
Key insight: For Whisper-Large-v3, unfiltered synthetic data provides the best cross-domain generalization (17.02% MLS WER), suggesting that the large model capacity can effectively leverage even lower-quality synthetic samples for improved robustness.
Limitations
- Training efficiency: Requires most compute among all configurations
- Noisy training signal: Includes low-quality synthetic samples (13.5% with q < 0.5)
- Domain specificity: Optimized for general Dutch; may underperform on technical domains
- Dialect coverage: Performance may vary across Dutch regional variants
Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-large-v3
- Training Data (Real): mozilla-foundation/common_voice_17_0
- Training Data (Synthetic): yuriyvnv/synthetic_transcript_nl
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- IEEE Access Paper: Enhancing ASR with Semantic Audio Filtering
License
Apache 2.0
- Downloads last month
- 20
Model tree for yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl
Base model
openai/whisper-large-v3Datasets used to train yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl
Collection including yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl
Evaluation results
- Test WER on Common Voice 17.0 (Dutch)test set self-reported4.440
- Test WER (MLS) on Multilingual LibriSpeech (Dutch)test set self-reported17.020