Whisper-Tiny Dutch - High-Quality Filtered Synthetic Data

This model is a fine-tuned version of openai/whisper-tiny for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with WAVe-filtered synthetic speech data using a strict high-quality threshold (q ≥ 0.8).

Introduction

How the Data Was Created

The training data combines real speech from Common Voice 17.0 with synthetic speech generated through a two-stage pipeline:

  1. Transcript Generation: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.

  2. Speech Synthesis: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.

  3. Quality Filtering with WAVe: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied WAVe (Word-Aligned Verification), a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model, only samples scoring above the strict threshold (q ≥ 0.8) were retained, resulting in 10,555 high-quality synthetic samples.

How the Model Was Created

The model was fine-tuned from openai/whisper-tiny using the Hugging Face Transformers library with the following approach:

  1. Mixed Training: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 10,555 strictly WAVe-filtered synthetic samples (45,507 total).

  2. Optimization: Trained for 5 epochs with a learning rate of 5e-5, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.

  3. Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 700 with a validation loss of 0.3323.

This high-quality filtering approach achieves 35% reduction in training steps compared to using all synthetic data, while maintaining competitive ASR performance.

Model Details

Property Value
Base Model openai/whisper-tiny
Language Dutch (nl)
Task Automatic Speech Recognition (transcribe)
Parameters 39M
Training Data Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8)
Total Training Samples 45,507
Sampling Rate 16kHz

Evaluation Results

This Model (whisper-tiny-high-mixed-nl)

Metric Value
Validation Loss 0.3323
Validation WER 19.59%
Test WER (Common Voice) 25.51%
Test WER (MLS) 43.76%
Best Checkpoint Step 700
Max Training Steps 890

Comparison with Other Training Configurations

Training Data Max Steps Val Loss Val WER Test WER (CV) Test WER (MLS)
Common Voice Only 680 0.3382 19.77% 26.00% 44.85%
High-Quality Filtered + CV 890 0.3323 19.59% 25.51% 43.76%
Mid-High Quality Filtered + CV 1,270 0.3292 19.36% 25.05% 43.11%
All Synthetic + CV (Unfiltered) 1,365 0.3207 19.61% 24.93% 43.12%

Key Performance Highlights

  • Most efficient training: Only 890 max steps (35% fewer than unfiltered)
  • 1.9% relative improvement on Common Voice test set vs baseline (25.51% vs 26.00%)
  • 2.4% relative improvement on MLS benchmark vs baseline (43.76% vs 44.85%)
  • Best quality-to-compute ratio: Achieves strong results with minimal synthetic data

Training Data

Dataset Composition

Source Samples Description
Common Voice 17.0 Dutch 34,952 Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript NL (q ≥ 0.8) 10,555 Strictly WAVe-filtered TTS audio
Total 45,507

Synthetic Data Generation Pipeline

The synthetic dataset (yuriyvnv/synthetic_transcript_nl) was generated using:

  1. Transcript Generation: GPT-4o-mini, matching Common Voice word count distribution
  2. Speech Synthesis: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
  3. Quality Filtering: WAVe model with strict threshold q ≥ 0.8

WAVe Quality Distribution (Dutch Synthetic Data)

Quality Level Samples Percentage Used in This Model
High (q ≥ 0.8) 10,555 30.2%
Medium (0.5 ≤ q < 0.8) 19,627 56.2%
Low (q < 0.5) 4,716 13.5%

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 5e-5
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Training Curve

Step  100: val_loss = 0.4770
Step  250: val_loss = 0.3746
Step  400: val_loss = 0.3457
Step  550: val_loss = 0.3341
Step  700: val_loss = 0.3323 ← Best checkpoint
Step  850: val_loss = 0.3358

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-tiny-high-mixed-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-high-mixed-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-high-mixed-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "nl"
model.generation_config.task = "transcribe"

Methodology

This model leverages WAVe (Word-Aligned Verification), a word-level quality assessment method for filtering synthetic speech data. Unlike sentence-level filtering approaches, WAVe:

  • Aligns each word to its corresponding audio frames using multi-head attention
  • Assigns per-word confidence scores via a GLU-based scorer
  • Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
  • Achieves 6.5% improvement over sentence-level filtering methods

The strict threshold (q ≥ 0.8) retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.

When to Use This Model

This model is ideal when:

  • Compute resources are limited: 35% fewer training steps than unfiltered approaches
  • Quick fine-tuning is needed: Smaller dataset enables faster iteration
  • Baseline improvement is sufficient: 1.9% improvement over CV-only training

Consider the mid-high quality filtered model if you need better absolute performance and have more compute budget.

Limitations

  • Model capacity: Whisper-Tiny (39M params) has limited representational power
  • Domain specificity: Optimized for general Dutch; may underperform on technical domains
  • Acoustic conditions: Trained on clean speech; noise robustness not guaranteed
  • Dialect coverage: Performance may vary across Dutch regional variants

Citation

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
12
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-tiny-high-mixed-nl

Finetuned
(1662)
this model

Datasets used to train yuriyvnv/whisper-tiny-high-mixed-nl

Collection including yuriyvnv/whisper-tiny-high-mixed-nl

Evaluation results