Whistle
Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.
Whistle was proposed in the paper Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found here.
Model details
Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available CommonVoice_v11
Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub:
Evaluation
Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).
Evaluation on Public CommonVoice_v11
%PER
%WER with 4-gram LM
More performance please ref to benchmark
Training Data
All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in lang-process. But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.
| Language | Language ID | # of phonemes | Train hours | Dev hours | Test hours |
|---|---|---|---|---|---|
English |
en |
39 | 2227.3 | 27.2 | 27.0 |
Spanish |
es |
32 | 382.3 | 26.0 | 26.5 |
French |
fr |
33 | 823.4 | 25.0 | 25.4 |
Italian |
it |
30 | 271.5 | 24.7 | 26.0 |
Kirghiz |
ky |
32 | 32.7 | 2.1 | 2.2 |
Dutch |
nl |
39 | 70.2 | 13.8 | 13.9 |
Russian |
ru |
32 | 149.8 | 14.6 | 15.0 |
Swedish |
sv-SE |
33 | 29.8 | 5.5 | 6.2 |
Turkish |
tr |
41 | 61.5 | 10.1 | 11.4 |
Tatar |
tt |
31 | 20.8 | 3.0 | 5.7 |
BibTeX entry and citation info
@article{yusuyin2025whistle,
title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
journal={IEEE Transactions on Audio, Speech and Language Processing},
year={2025},
publisher={IEEE}
}
Community
If you encounter problems in use, you can directly raise Issues on the github page.
- Downloads last month
- 4