Wav2Vec2-NL
A Dutch Wav2Vec2-base model, pre-trained on 960 hours of exclusively Dutch speech.
Pre-training data was extracted from a combination of:
- the Spoken Dutch Corpus (537 hours; incl. spontaneous conversations, interviews, read speech and news reports)
- the Dutch component of Multilingual LibriSpeech (211 hours; audiobook segments)
- the Dutch subset of the CommonVoice 16.1 corpus (212 hours; read aloud speech)
More information, incl. the training manifest and configuration is available in the Wav2Vec2-NL repository on Zenodo.
Analyses of Dutch phonetic and lexical features encoded in Wav2Vec2-NL hidden states are reported in the paper What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training (Interspeech 2025; see full citation below).
Note: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for an explanation of fine-tuning Wav2Vec2 models on HuggingFace.
Usage
from transformers import Wav2Vec2Model
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('amsterdamNLP/Wav2Vec2-NL')
model = Wav2Vec2Model.from_pretrained('amsterdamNLP/Wav2Vec2-NL')
Citation
The Wav2Vec2-NL model was published as part of: de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025). What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. INTERSPEECH 2025. https://doi.org/10.21437/Interspeech.2025-1526
BibTex entry:
@inproceedings{deheerkloots25_interspeech,
title = {What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training},
author = {Marianne {de Heer Kloots} and Hosein Mohebbi and Charlotte Pouw and Gaofei Shen and Willem Zuidema and Martijn Bentum},
year = {2025},
booktitle = {Interspeech 2025},
doi = {10.21437/Interspeech.2025-1526},
}
- Downloads last month
- 21