File size: 10,904 Bytes
5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 8f88952 f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec 8f88952 5763dec f072b24 5763dec f072b24 8f88952 5763dec f072b24 8f88952 f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 5763dec f072b24 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 |
---
license: apache-2.0
language:
- nl
base_model: openai/whisper-large-v3
tags:
- automatic-speech-recognition
- whisper
- dutch
- speech
- audio
- synthetic-data
- asr
- hf-asr-leaderboard
datasets:
- mozilla-foundation/common_voice_17_0
- yuriyvnv/synthetic_transcript_nl
model-index:
- name: whisper-large-v3-high-mixed-nl
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 17.0 (Dutch)
type: mozilla-foundation/common_voice_17_0
config: nl
split: test
metrics:
- type: wer
value: 4.43
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Multilingual LibriSpeech (Dutch)
type: facebook/multilingual_librispeech
config: dutch
split: test
metrics:
- type: wer
value: 20.29
name: Test WER (MLS)
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# Whisper-Large-v3 Dutch - High-Quality Filtered Synthetic Data
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with **WAVe-filtered high-quality synthetic speech data only** using a strict threshold (q ≥ 0.8).
## Introduction
### How the Data Was Created
The training data combines real speech from Common Voice 17.0 with synthetic speech generated through a two-stage pipeline:
1. **Transcript Generation**: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.
2. **Speech Synthesis**: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
3. **Quality Filtering with WAVe**: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied **WAVe (Word-Aligned Verification)**, a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model, only samples scoring above the strict threshold (q ≥ 0.8) were retained, resulting in 10,555 high-quality synthetic samples.
### How the Model Was Created
The model was fine-tuned from `openai/whisper-large-v3` using the Hugging Face Transformers library with the following approach:
1. **Mixed Training**: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 10,555 strictly WAVe-filtered high-quality synthetic samples (45,507 total).
2. **Optimization**: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 350 with a validation loss of 0.0552.
This high-quality filtering approach achieves **35% reduction in training steps** compared to using all synthetic data, while maintaining excellent ASR performance.
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | openai/whisper-large-v3 |
| **Language** | Dutch (nl) |
| **Task** | Automatic Speech Recognition (transcribe) |
| **Parameters** | 1550M |
| **Training Data** | Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8) |
| **Total Training Samples** | 45,507 |
| **Sampling Rate** | 16kHz |
## Evaluation Results
### This Model (whisper-large-v3-high-mixed-nl)
| Metric | Value |
|--------|-------|
| **Validation Loss** | 0.0520 |
| **Validation WER** | 3.57% |
| **Test WER (Common Voice)** | 4.43% |
| **Test WER (MLS)** | 20.29% |
| **Best Checkpoint** | Step 350 |
| **Max Training Steps** | 890 |
### Comparison with Other Training Configurations (Whisper-Large-v3 Dutch)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---------------|-----------|----------|---------|---------------|----------------|
| Common Voice Only | 680 | 0.0549 | 3.56% | 4.39% | 22.43% |
| **High-Quality Filtered + CV** | **890** | **0.0520** | **3.57%** | **4.43%** | **20.29%** |
| Mid-High Quality Filtered + CV | 1,270 | 0.0570 | 3.63% | 4.48% | 17.25% |
| All Synthetic + CV (Unfiltered) | 1,365 | 0.0560 | 3.61% | 4.44% | 17.02% |
### Key Performance Highlights
- **Most efficient training**: Only 890 max steps (35% fewer than unfiltered)
- **Best validation loss** (0.0520) among all Whisper-Large-v3 Dutch configurations
- **Competitive in-domain performance**: 4.43% Test WER on Common Voice
- **9.5% relative improvement** on MLS benchmark vs baseline (20.29% vs 22.43%)
- **Best quality-to-compute ratio**: Strong results with only top-tier synthetic data (30.2%)
## Training Data
### Dataset Composition
| Source | Samples | Description |
|--------|---------|-------------|
| [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset |
| [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) (q ≥ 0.8) | 10,555 | Strictly WAVe-filtered TTS audio (high quality only) |
| **Total** | **45,507** | |
### Synthetic Data Generation Pipeline
The synthetic dataset ([yuriyvnv/synthetic_transcript_nl](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)) was generated using:
1. **Transcript Generation**: GPT-4o-mini, matching Common Voice word count distribution
2. **Speech Synthesis**: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
3. **Quality Filtering**: WAVe model with strict threshold q ≥ 0.8 (high quality only)
### WAVe Quality Distribution (Dutch Synthetic Data)
| Quality Level | Samples | Percentage | Used in This Model |
|--------------|---------|------------|-------------------|
| High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
| Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% | ✗ |
| Low (q < 0.5) | 4,716 | 13.5% | ✗ |
This strict threshold retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
## Training Procedure
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning Rate | 5e-6 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
### Training Infrastructure
- **GPU**: NVIDIA H200 (140GB VRAM)
- **Operating System**: Ubuntu 22.04
- **Framework**: Hugging Face Transformers
### Training Curve
```
Step 100: val_loss = 0.0588
Step 200: val_loss = 0.0562
Step 250: val_loss = 0.0561
Step 350: val_loss = 0.0552 ← Best checkpoint
Step 500: val_loss = 0.0601
Step 650: val_loss = 0.0627
Step 850: val_loss = 0.0680
```
## Usage
### Transcription Pipeline
```python
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-large-v3-high-mixed-nl",
device="cuda"
)
result = transcriber("path/to/dutch_audio.wav")
print(result["text"])
```
### Direct Model Usage
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-high-mixed-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-high-mixed-nl")
model.to("cuda")
audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
```
### Specifying Language
```python
model.generation_config.language = "nl"
model.generation_config.task = "transcribe"
```
## Methodology
This model leverages **WAVe (Word-Aligned Verification)**, a word-level quality assessment method for filtering synthetic speech data. Unlike sentence-level filtering approaches, WAVe:
- Aligns each word to its corresponding audio frames using multi-head attention
- Assigns per-word confidence scores via a GLU-based scorer
- Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
- Achieves **6.5% improvement** over sentence-level filtering methods
The strict threshold (q ≥ 0.8) retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
## When to Use This Model
This model is ideal when:
- **Compute resources are limited**: 35% fewer training steps than unfiltered approaches
- **Quick fine-tuning is needed**: Smaller dataset (45,507 samples) enables faster iteration
- **Best validation performance required**: Achieves lowest validation loss (0.0520)
- **Quality over quantity**: Only top-tier synthetic data (30.2%) for clean training signal
Consider other variants based on your needs:
- [whisper-large-v3-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-mixed-cv-nl): Better cross-domain performance with more data
- [whisper-large-v3-cv-fully-synthetic-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl): Best cross-domain generalization (17.02% MLS)
## Limitations
- **Domain specificity**: Optimized for general Dutch; may underperform on technical domains
- **Acoustic conditions**: Trained on clean speech; noise robustness not guaranteed
- **Dialect coverage**: Performance may vary across Dutch regional variants
## Citation
```bibtex
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
```
## References
- **Base Model**: [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
- **Training Data (Real)**: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- **Training Data (Synthetic)**: [yuriyvnv/synthetic_transcript_nl](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)
- **Whisper Paper**: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
- **IEEE Access Paper**: [Enhancing ASR with Semantic Audio Filtering](https://ieeexplore.ieee.org/document/10720758)
## License
Apache 2.0 |