What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
Abstract
Self-supervised Wav2Vec2 models encode Dutch linguistic features more accurately when pre-trained exclusively on Dutch data, compared to similar amounts of English or multilingual data, as shown by clustering and classification probes, and demonstrated through improved Automatic Speech Recognition performance.
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
Community
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC (2025)
- Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks (2025)
- Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically (2025)
- Speechless: Speech Instruction Training Without Speech for Low Resource Languages (2025)
- Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information (2025)
- Differentiable K-means for Fully-optimized Discrete Token-based ASR (2025)
- GigaAM: Efficient Self-Supervised Learner for Speech Recognition (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper