Whisper-Large-v3 Dutch - Full Synthetic Data (Unfiltered)

This model is a fine-tuned version of openai/whisper-large-v3 for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.

Introduction

Purpose

This model uses all available synthetic data without WAVe quality filtering to evaluate the impact of maximum data augmentation for the largest Whisper model. It achieves excellent performance (4.44% Test WER) but requires significantly more training steps than filtered approaches, demonstrating the quality-vs-quantity tradeoff in synthetic data augmentation.

How the Data Was Created

The training data combines real speech from Common Voice 17.0 with the complete synthetic dataset:

  1. Transcript Generation: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.

  2. Speech Synthesis: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.

  3. No Quality Filtering: Unlike other models in this series, no WAVe filtering was applied. All 34,898 synthetic samples were used, including those with potential synthesis defects.

How the Model Was Created

The model was fine-tuned from openai/whisper-large-v3 using the Hugging Face Transformers library:

  1. Mixed Training: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with all 34,898 synthetic samples (69,850 total).

  2. Optimization: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.

  3. Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 450 with a validation loss of 0.0564.

This approach achieves strong ASR performance but requires 100% more training steps than training on Common Voice only.

Model Details

Property Value
Base Model openai/whisper-large-v3
Language Dutch (nl)
Task Automatic Speech Recognition (transcribe)
Parameters 1550M
Training Data Common Voice 17.0 + All Synthetic (Unfiltered)
Total Training Samples 69,850
Sampling Rate 16kHz

Evaluation Results

This Model (whisper-large-v3-cv-fully-synthetic-nl)

Metric Value
Validation Loss 0.0560
Validation WER 3.61%
Test WER (Common Voice) 4.44%
Test WER (MLS) 17.02%
Best Checkpoint Step 450
Max Training Steps 1,365

Comparison with Other Training Configurations (Whisper-Large-v3 Dutch)

Training Data Max Steps Val Loss Val WER Test WER (CV) Test WER (MLS)
Common Voice Only 680 0.0549 3.56% 4.39% 22.43%
High-Quality Filtered + CV 890 0.0520 3.57% 4.43% 20.29%
Mid-High Quality Filtered + CV 1,270 0.0570 3.63% 4.48% 17.25%
All Synthetic + CV (Unfiltered) 1,365 0.0560 3.61% 4.44% 17.02%

Key Performance Highlights

  • Best cross-domain generalization on MLS benchmark (17.02% WER)
  • Competitive in-domain performance: 4.44% Test WER on Common Voice (within 0.05% of baseline)
  • 24.1% relative improvement on MLS vs baseline (17.02% vs 22.43%)
  • Tradeoff: Requires 1,365 steps vs 680 for CV-only (100% more compute)

Training Data

Dataset Composition

Source Samples Description
Common Voice 17.0 Dutch 34,952 Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript NL (all) 34,898 Complete TTS audio without filtering
Total 69,850

Synthetic Data Generation Pipeline

The synthetic dataset (yuriyvnv/synthetic_transcript_nl) was generated using:

  1. Transcript Generation: GPT-4o-mini, matching Common Voice word count distribution
  2. Speech Synthesis: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
  3. No Filtering: All samples used regardless of quality

Quality Distribution (For Reference)

While this model uses all data, WAVe quality assessment shows the distribution:

Quality Level Samples Percentage Used in This Model
High (q ≥ 0.8) 10,555 30.2%
Medium (0.5 ≤ q < 0.8) 19,627 56.2%
Low (q < 0.5) 4,716 13.5%
Total 34,898 100% All used

Note: 13.5% of the synthetic data (4,716 samples) would be filtered out by WAVe, but is included in this model's training.

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 5e-6
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Training Curve

Step  100: val_loss = 0.0618
Step  200: val_loss = 0.0592
Step  300: val_loss = 0.0582
Step  450: val_loss = 0.0564 ← Best checkpoint
Step  600: val_loss = 0.0594
Step  800: val_loss = 0.0596
Step 1000: val_loss = 0.0641

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "nl"
model.generation_config.task = "transcribe"

When to Use This Model

This model is ideal when:

  • Best cross-domain performance is required: Achieves 17.02% WER on MLS (best among Large-v3 Dutch)
  • Compute budget is not a constraint: Requires most training steps (1,365)
  • Quality filtering is not available: Uses raw synthetic data

Consider filtered alternatives for better efficiency:

Quality vs Quantity Analysis

This model demonstrates the tradeoff between data quantity and quality for Whisper-Large-v3:

Approach Synthetic Samples Training Steps Test WER (CV) Test WER (MLS)
CV Only 0 680 4.39% 22.43%
High-Quality (q≥0.8) 10,555 890 4.43% 20.29%
Mid-High (q≥0.5) 30,182 1,270 4.48% 17.25%
Unfiltered (this model) 34,898 1,365 4.44% 17.02%

Key insight: For Whisper-Large-v3, unfiltered synthetic data provides the best cross-domain generalization (17.02% MLS WER), suggesting that the large model capacity can effectively leverage even lower-quality synthetic samples for improved robustness.

Limitations

  • Training efficiency: Requires most compute among all configurations
  • Noisy training signal: Includes low-quality synthetic samples (13.5% with q < 0.5)
  • Domain specificity: Optimized for general Dutch; may underperform on technical domains
  • Dialect coverage: Performance may vary across Dutch regional variants

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
20
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl

Finetuned
(664)
this model

Datasets used to train yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl

Collection including yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl

Evaluation results