Whisper-Large-v3 Dutch - Common Voice Only (Baseline)

This model is a fine-tuned version of openai/whisper-large-v3 for Dutch automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Dutch without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech in ASR training.

Introduction

Purpose

This baseline model demonstrates the performance achievable using only real, crowdsourced speech data from Common Voice 17.0. It serves as a reference point for comparing the effectiveness of synthetic data augmentation approaches, including:

  • Quality-filtered synthetic data (WAVe-based filtering)
  • Unfiltered synthetic data augmentation
  • Different quality thresholds and their impact on ASR performance

Training Approach

The model was fine-tuned from openai/whisper-large-v3 using standard supervised learning on Common Voice 17.0 Dutch:

  1. Real Speech Only: Trained on 34,952 crowdsourced speech samples from Common Voice, with no synthetic augmentation.

  2. Optimization: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.

  3. Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 250 with a validation loss of 0.0550.

This baseline achieves strong in-domain performance (4.39% Test WER on Common Voice) but shows limitations in cross-domain generalization (22.43% MLS WER), which synthetic data augmentation helps address.

Model Details

Property Value
Base Model openai/whisper-large-v3
Language Dutch (nl)
Task Automatic Speech Recognition (transcribe)
Parameters 1550M
Training Data Common Voice 17.0 Dutch (Real Speech Only)
Total Training Samples 34,952
Sampling Rate 16kHz

Evaluation Results

This Model (whisper-large-v3-cv-only-nl)

Metric Value
Validation Loss 0.0549
Validation WER 3.56%
Test WER (Common Voice) 4.39%
Test WER (MLS) 22.43%
Best Checkpoint Step 250
Max Training Steps 680

Comparison with Synthetic Data Augmentation (Whisper-Large-v3 Dutch)

Training Data Max Steps Val Loss Val WER Test WER (CV) Test WER (MLS) MLS Improvement
Common Voice Only (Baseline) 680 0.0549 3.56% 4.39% 22.43%
High + Mixed Quality (q ≥ 0.5) 890 0.0520 3.57% 4.43% 20.29% +9.5%
All Synthetic (Unfiltered) 1,365 0.0560 3.61% 4.44% 17.02% +24.1%

Key Performance Characteristics

  • Fastest training: Fewest steps (680) among all configurations
  • Smallest dataset: Only 34,952 samples (no synthetic augmentation)
  • Strong in-domain: 4.39% Test WER on Common Voice
  • Limited cross-domain: 22.43% MLS WER (poorest generalization)
  • Reference baseline: Establishes performance without synthetic data

Training Data

Dataset Composition

Source Samples Description
Common Voice 17.0 Dutch 34,952 Real crowdsourced speech
Synthetic Data 0 No synthetic augmentation
Total 34,952

Common Voice 17.0 Dutch

Common Voice is Mozilla's open-source, crowdsourced speech dataset:

  • Recording conditions: Varied (home recordings, different microphones, background noise)
  • Speaker diversity: Multiple speakers, ages, and accents
  • Content: Read sentences from various domains
  • Quality: Human-validated transcriptions

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 5e-6
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Training Curve

Step  100: val_loss = 0.0570
Step  150: val_loss = 0.0554
Step  200: val_loss = 0.0556
Step  250: val_loss = 0.0550 ← Best checkpoint
Step  300: val_loss = 0.0567
Step  400: val_loss = 0.0574
Step  500: val_loss = 0.0613
Step  650: val_loss = 0.0648

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-only-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "nl"
model.generation_config.task = "transcribe"

When to Use This Model

This baseline model is ideal when:

  • No synthetic data is available: Training on real data only
  • Maximum training speed required: Fastest convergence (680 steps)
  • In-domain performance is priority: Strong on Common Voice-like data (4.39% WER)
  • Comparing augmentation approaches: Reference for measuring synthetic data impact

Consider synthetic-augmented variants for better cross-domain performance:

Impact of Synthetic Data Augmentation

This baseline enables quantifying the value of synthetic speech:

Metric CV-Only + Synthetic (best) Improvement
Training Steps 680 1,365 +101%
Dataset Size 34,952 69,850 +100%
Test WER (CV) 4.39% 4.44% -0.05pp
Test WER (MLS) 22.43% 17.02% +24.1%

Key insight: Synthetic data augmentation maintains in-domain performance while dramatically improving cross-domain generalization, at the cost of increased training time.

Limitations

  • Domain specificity: Optimized for Common Voice-style speech; cross-domain performance limited
  • Acoustic diversity: Limited to Common Voice recording conditions and speaker pool
  • Data scarcity: No augmentation means model capacity may be underutilized
  • Generalization: 22.43% MLS WER shows difficulty adapting to different acoustic conditions

Citation

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
20
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-large-v3-cv-only-nl

Finetuned
(664)
this model

Dataset used to train yuriyvnv/whisper-large-v3-cv-only-nl

Collection including yuriyvnv/whisper-large-v3-cv-only-nl

Evaluation results