|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- nl |
|
|
base_model: openai/whisper-small |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- whisper |
|
|
- dutch |
|
|
- speech |
|
|
- audio |
|
|
- asr |
|
|
- hf-asr-leaderboard |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
model-index: |
|
|
- name: whisper-small-cv-only-nl |
|
|
results: |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: Common Voice 17.0 (Dutch) |
|
|
type: mozilla-foundation/common_voice_17_0 |
|
|
config: nl |
|
|
split: test |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 11.13 |
|
|
name: Test WER |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: Multilingual LibriSpeech (Dutch) |
|
|
type: facebook/multilingual_librispeech |
|
|
config: dutch |
|
|
split: test |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 30.71 |
|
|
name: Test WER (MLS) |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Whisper-Small Dutch - Common Voice Only (Baseline) |
|
|
|
|
|
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) for Dutch automatic speech recognition (ASR). It was trained exclusively on **Common Voice 17.0 Dutch** without any synthetic data augmentation, serving as a baseline for comparison with synthetic-augmented models. |
|
|
|
|
|
## Introduction |
|
|
|
|
|
### Purpose |
|
|
|
|
|
This model serves as the **baseline** for evaluating the effectiveness of synthetic data augmentation in Dutch ASR. By training only on real speech data from Common Voice 17.0, we establish reference performance metrics against which synthetic-augmented models can be compared. |
|
|
|
|
|
### How the Model Was Created |
|
|
|
|
|
The model was fine-tuned from `openai/whisper-small` using the Hugging Face Transformers library: |
|
|
|
|
|
1. **Training Data**: 34,952 real speech samples from Common Voice 17.0 Dutch (train split). |
|
|
|
|
|
2. **Optimization**: Trained for 5 epochs with a learning rate of 1e-5, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU. |
|
|
|
|
|
3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 400 with a validation loss of 0.1492. |
|
|
|
|
|
This baseline achieves **11.13% WER** on the Common Voice test set, which synthetic-augmented models improve upon by up to 2.4% relative. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Base Model** | openai/whisper-small | |
|
|
| **Language** | Dutch (nl) | |
|
|
| **Task** | Automatic Speech Recognition (transcribe) | |
|
|
| **Parameters** | 244M | |
|
|
| **Training Data** | Common Voice 17.0 Dutch only | |
|
|
| **Total Training Samples** | 34,952 | |
|
|
| **Sampling Rate** | 16kHz | |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### This Model (whisper-small-cv-only-nl) |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Validation Loss** | 0.1491 | |
|
|
| **Validation WER** | 8.73% | |
|
|
| **Test WER (Common Voice)** | 11.13% | |
|
|
| **Test WER (MLS)** | 30.71% | |
|
|
| **Best Checkpoint** | Step 400 | |
|
|
| **Max Training Steps** | 680 | |
|
|
|
|
|
### Comparison with Synthetic-Augmented Models (Whisper-Small Dutch) |
|
|
|
|
|
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) | |
|
|
|---------------|-----------|----------|---------|---------------|----------------| |
|
|
| **Common Voice Only** | **680** | **0.1491** | **8.73%** | **11.13%** | **30.71%** | |
|
|
| High-Quality Filtered + CV | 890 | 0.1493 | 8.76% | 11.00% | 29.91% | |
|
|
| Mid-High Quality Filtered + CV | 1,270 | 0.1484 | 8.73% | 10.86% | 30.04% | |
|
|
| All Synthetic + CV (Unfiltered) | 1,365 | 0.1484 | 8.64% | 10.91% | 30.06% | |
|
|
|
|
|
### Key Observations |
|
|
|
|
|
- **Baseline performance**: 11.13% Test WER on Common Voice, 30.71% on MLS |
|
|
- **Fastest training**: Only 680 max steps (smallest dataset) |
|
|
- **Room for improvement**: Synthetic augmentation reduces Test WER by up to 0.27% absolute (2.4% relative) |
|
|
- **Cross-domain gap**: 19.58% absolute difference between CV and MLS performance highlights domain mismatch |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Dataset |
|
|
|
|
|
| Source | Samples | Description | |
|
|
|--------|---------|-------------| |
|
|
| [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset | |
|
|
|
|
|
Common Voice 17.0 Dutch contains crowdsourced voice recordings from volunteer contributors reading text prompts. The dataset provides diverse speaker demographics but is limited in acoustic conditions and speaking styles. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Learning Rate | 1e-5 | |
|
|
| Batch Size (Global) | 256 | |
|
|
| Warmup Steps | 200 | |
|
|
| Max Epochs | 5 | |
|
|
| Precision | BF16 | |
|
|
| Optimizer | AdamW (fused) | |
|
|
| Eval Steps | 50 | |
|
|
| Metric for Best Model | eval_loss | |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **GPU**: NVIDIA H200 (140GB VRAM) |
|
|
- **Operating System**: Ubuntu 22.04 |
|
|
- **Framework**: Hugging Face Transformers |
|
|
|
|
|
### Training Curve |
|
|
|
|
|
``` |
|
|
Step 100: val_loss = 0.1754 |
|
|
Step 200: val_loss = 0.1563 |
|
|
Step 300: val_loss = 0.1514 |
|
|
Step 400: val_loss = 0.1492 ← Best checkpoint |
|
|
Step 500: val_loss = 0.1516 |
|
|
Step 650: val_loss = 0.1533 |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Transcription Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
transcriber = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="yuriyvnv/whisper-small-cv-only-nl", |
|
|
device="cuda" |
|
|
) |
|
|
|
|
|
result = transcriber("path/to/dutch_audio.wav") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
### Direct Model Usage |
|
|
|
|
|
```python |
|
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
|
import librosa |
|
|
|
|
|
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-cv-only-nl") |
|
|
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-cv-only-nl") |
|
|
model.to("cuda") |
|
|
|
|
|
audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000) |
|
|
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda") |
|
|
|
|
|
predicted_ids = model.generate(input_features) |
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
|
print(transcription) |
|
|
``` |
|
|
|
|
|
### Specifying Language |
|
|
|
|
|
```python |
|
|
model.generation_config.language = "nl" |
|
|
model.generation_config.task = "transcribe" |
|
|
``` |
|
|
|
|
|
## When to Use This Model |
|
|
|
|
|
This model is ideal for: |
|
|
- **Baseline comparisons**: Evaluating the impact of synthetic data augmentation |
|
|
- **Real-data-only requirements**: When synthetic data usage is not permitted |
|
|
- **Minimal training**: Fastest training time among all configurations |
|
|
|
|
|
For better performance, consider the synthetic-augmented variants: |
|
|
- [whisper-small-high-mixed-nl](https://huggingface.co/yuriyvnv/whisper-small-high-mixed-nl): +0.13% absolute improvement, best MLS performance |
|
|
- [whisper-small-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-small-mixed-cv-nl): +0.27% absolute improvement, best CV performance |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **No synthetic augmentation**: Does not benefit from additional acoustic diversity |
|
|
- **Domain specificity**: Trained only on Common Voice; limited generalization to other domains |
|
|
- **Cross-domain performance**: Significant performance drop on MLS benchmark (30.71% vs 11.13%) |
|
|
- **Dialect coverage**: Performance may vary across Dutch regional variants |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{perezhohin2024enhancing, |
|
|
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance}, |
|
|
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro}, |
|
|
journal={IEEE Access}, |
|
|
year={2024}, |
|
|
publisher={IEEE} |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
- **Base Model**: [openai/whisper-small](https://huggingface.co/openai/whisper-small) |
|
|
- **Training Data**: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) |
|
|
- **Whisper Paper**: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) |
|
|
- **IEEE Access Paper**: [Enhancing ASR with Semantic Audio Filtering](https://ieeexplore.ieee.org/document/10720758) |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |