File size: 10,904 Bytes
5763dec
 
f072b24
 
5763dec
 
f072b24
 
 
 
 
 
 
 
 
 
 
5763dec
 
f072b24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5763dec
 
8f88952
f072b24
8f88952
f072b24
 
 
 
 
 
 
 
 
 
 
8f88952
f072b24
 
 
 
 
8f88952
f072b24
 
 
 
 
8f88952
f072b24
 
 
 
 
 
 
 
 
8f88952
 
f072b24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f88952
f072b24
8f88952
 
 
f072b24
 
 
 
 
 
 
 
8f88952
 
f072b24
 
 
 
 
 
 
8f88952
f072b24
 
 
 
 
 
8f88952
f072b24
 
8f88952
f072b24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5763dec
f072b24
 
 
 
5763dec
f072b24
5763dec
f072b24
5763dec
f072b24
 
 
 
5763dec
8f88952
5763dec
f072b24
5763dec
f072b24
8f88952
 
 
 
5763dec
f072b24
8f88952
f072b24
5763dec
f072b24
5763dec
f072b24
 
 
5763dec
f072b24
5763dec
f072b24
 
 
 
 
 
 
 
 
5763dec
f072b24
5763dec
f072b24
 
 
 
 
5763dec
f072b24
5763dec
f072b24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
license: apache-2.0
language:
- nl
base_model: openai/whisper-large-v3
tags:
- automatic-speech-recognition
- whisper
- dutch
- speech
- audio
- synthetic-data
- asr
- hf-asr-leaderboard
datasets:
- mozilla-foundation/common_voice_17_0
- yuriyvnv/synthetic_transcript_nl
model-index:
- name: whisper-large-v3-high-mixed-nl
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 17.0 (Dutch)
      type: mozilla-foundation/common_voice_17_0
      config: nl
      split: test
    metrics:
    - type: wer
      value: 4.43
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Multilingual LibriSpeech (Dutch)
      type: facebook/multilingual_librispeech
      config: dutch
      split: test
    metrics:
    - type: wer
      value: 20.29
      name: Test WER (MLS)
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

# Whisper-Large-v3 Dutch - High-Quality Filtered Synthetic Data

This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with **WAVe-filtered high-quality synthetic speech data only** using a strict threshold (q ≥ 0.8).

## Introduction

### How the Data Was Created

The training data combines real speech from Common Voice 17.0 with synthetic speech generated through a two-stage pipeline:

1. **Transcript Generation**: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.

2. **Speech Synthesis**: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.

3. **Quality Filtering with WAVe**: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied **WAVe (Word-Aligned Verification)**, a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model, only samples scoring above the strict threshold (q ≥ 0.8) were retained, resulting in 10,555 high-quality synthetic samples.

### How the Model Was Created

The model was fine-tuned from `openai/whisper-large-v3` using the Hugging Face Transformers library with the following approach:

1. **Mixed Training**: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 10,555 strictly WAVe-filtered high-quality synthetic samples (45,507 total).

2. **Optimization**: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.

3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 350 with a validation loss of 0.0552.

This high-quality filtering approach achieves **35% reduction in training steps** compared to using all synthetic data, while maintaining excellent ASR performance.

## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | openai/whisper-large-v3 |
| **Language** | Dutch (nl) |
| **Task** | Automatic Speech Recognition (transcribe) |
| **Parameters** | 1550M |
| **Training Data** | Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8) |
| **Total Training Samples** | 45,507 |
| **Sampling Rate** | 16kHz |

## Evaluation Results

### This Model (whisper-large-v3-high-mixed-nl)

| Metric | Value |
|--------|-------|
| **Validation Loss** | 0.0520 |
| **Validation WER** | 3.57% |
| **Test WER (Common Voice)** | 4.43% |
| **Test WER (MLS)** | 20.29% |
| **Best Checkpoint** | Step 350 |
| **Max Training Steps** | 890 |

### Comparison with Other Training Configurations (Whisper-Large-v3 Dutch)

| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---------------|-----------|----------|---------|---------------|----------------|
| Common Voice Only | 680 | 0.0549 | 3.56% | 4.39% | 22.43% |
| **High-Quality Filtered + CV** | **890** | **0.0520** | **3.57%** | **4.43%** | **20.29%** |
| Mid-High Quality Filtered + CV | 1,270 | 0.0570 | 3.63% | 4.48% | 17.25% |
| All Synthetic + CV (Unfiltered) | 1,365 | 0.0560 | 3.61% | 4.44% | 17.02% |

### Key Performance Highlights

- **Most efficient training**: Only 890 max steps (35% fewer than unfiltered)
- **Best validation loss** (0.0520) among all Whisper-Large-v3 Dutch configurations
- **Competitive in-domain performance**: 4.43% Test WER on Common Voice
- **9.5% relative improvement** on MLS benchmark vs baseline (20.29% vs 22.43%)
- **Best quality-to-compute ratio**: Strong results with only top-tier synthetic data (30.2%)

## Training Data

### Dataset Composition

| Source | Samples | Description |
|--------|---------|-------------|
| [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset |
| [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) (q ≥ 0.8) | 10,555 | Strictly WAVe-filtered TTS audio (high quality only) |
| **Total** | **45,507** | |

### Synthetic Data Generation Pipeline

The synthetic dataset ([yuriyvnv/synthetic_transcript_nl](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)) was generated using:

1. **Transcript Generation**: GPT-4o-mini, matching Common Voice word count distribution
2. **Speech Synthesis**: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
3. **Quality Filtering**: WAVe model with strict threshold q ≥ 0.8 (high quality only)

### WAVe Quality Distribution (Dutch Synthetic Data)

| Quality Level | Samples | Percentage | Used in This Model |
|--------------|---------|------------|-------------------|
| High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
| Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% | ✗ |
| Low (q < 0.5) | 4,716 | 13.5% | ✗ |

This strict threshold retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.

## Training Procedure

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning Rate | 5e-6 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |

### Training Infrastructure

- **GPU**: NVIDIA H200 (140GB VRAM)
- **Operating System**: Ubuntu 22.04
- **Framework**: Hugging Face Transformers

### Training Curve

```
Step  100: val_loss = 0.0588
Step  200: val_loss = 0.0562
Step  250: val_loss = 0.0561
Step  350: val_loss = 0.0552 ← Best checkpoint
Step  500: val_loss = 0.0601
Step  650: val_loss = 0.0627
Step  850: val_loss = 0.0680
```

## Usage

### Transcription Pipeline

```python
from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-high-mixed-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])
```

### Direct Model Usage

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-high-mixed-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-high-mixed-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
```

### Specifying Language

```python
model.generation_config.language = "nl"
model.generation_config.task = "transcribe"
```

## Methodology

This model leverages **WAVe (Word-Aligned Verification)**, a word-level quality assessment method for filtering synthetic speech data. Unlike sentence-level filtering approaches, WAVe:

- Aligns each word to its corresponding audio frames using multi-head attention
- Assigns per-word confidence scores via a GLU-based scorer
- Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
- Achieves **6.5% improvement** over sentence-level filtering methods

The strict threshold (q ≥ 0.8) retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.

## When to Use This Model

This model is ideal when:
- **Compute resources are limited**: 35% fewer training steps than unfiltered approaches
- **Quick fine-tuning is needed**: Smaller dataset (45,507 samples) enables faster iteration
- **Best validation performance required**: Achieves lowest validation loss (0.0520)
- **Quality over quantity**: Only top-tier synthetic data (30.2%) for clean training signal

Consider other variants based on your needs:
- [whisper-large-v3-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-mixed-cv-nl): Better cross-domain performance with more data
- [whisper-large-v3-cv-fully-synthetic-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl): Best cross-domain generalization (17.02% MLS)

## Limitations

- **Domain specificity**: Optimized for general Dutch; may underperform on technical domains
- **Acoustic conditions**: Trained on clean speech; noise robustness not guaranteed
- **Dialect coverage**: Performance may vary across Dutch regional variants

## Citation

```bibtex
@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}
```

## References

- **Base Model**: [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
- **Training Data (Real)**: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- **Training Data (Synthetic)**: [yuriyvnv/synthetic_transcript_nl](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)
- **Whisper Paper**: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
- **IEEE Access Paper**: [Enhancing ASR with Semantic Audio Filtering](https://ieeexplore.ieee.org/document/10720758)

## License

Apache 2.0