Model Details
Model Description
A 17.31M parameter multilingual linear projector version 2 trained for automatic speech recognition (ASR) using the SLAM-ASR speechLLM framework. Within this framework, only the linear projector was trained alongside a frozen speech encoder (Whisper-large-v3-turbo) and frozen LLM (EuroLLM-1.7B).
- Developed by: SpeechTek Unit at Fondazione Bruno Kessler
- Funded by: This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
- Model type: Linear projector in a speechLLM framework
- Supported Language(s): English, French, German, Italian, Spanish, Portuguese, Dutch, Polish, Hungarian, Czech, Romanian, Bulgarian, Slovak, Slovene, Serbian, Greek, Danish, Swedish, Finnish, Latvian, Lithuanian, Estonian, Welsh, Maltese, Breton, Irish, Galician, and Basque.
- License: CC-BY-4.0
Uses
This model is trained for Automatic Speech Recognition (ASR) and is meant to be the version 2 of the mEUltilingual speechLLM projectors collection.
How to Get Started with the Model
This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the SLAM-ASR codebase. Kindly refer to the instructions there for further details.
Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
Training Details
Training Data
The linear projector was trained with a multilingual dataset covering 28 European languages, that relys on widely used speech datasets: Common Voice 17.0, Fleurs, and Vox-Populi. As the distribution of data across languages is highli imbalanced, we applied a cap of 100K audio samples per language per dataset, discarding any additional samples beyond this threshold. This strategy allowed us to reduce data skew while keeping training computationally feasible. To assess the generalizability and robustness of our models on out-of-domain speech, we used the official evaluation set of the INTERSPEECH 2025 MLC-SLM Challenge.
Training Procedure
- The model was trained using the code-based provided by the official SLAM-ASR Github repository with
torchrun
. - Only the linear projector was trained.
- The whisper-large-v3-turbo speech encoder (Whisper-large-v3-turbo) and LLM (EuroLLM-1.7B) were kept frozen, but applying LoRA during training.
- A single monolingual prompt has been used for the training: "Transcribe speech to text."
- Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
Training Hyperparameters
llm_name | eurollm-1.7b |
llm_dim | 2048 |
context_length | 4096 |
encoder_name | whisper |
encoder_projector_ds_rate | 5 |
encoder_dim | 1280 |
encoder_projector | linear |
input_type | mel |
mel_size | 128 |
epochs | 3 |
freeze_encoder | true |
freeze_llm | true |
warmup_steps | 1000 |
total_steps | 100000 |
lr | 1e-4 |
validation_interval | 1000 |
batch_size_training | 4 |
val_size_training | 4 |
num_workers_dataloader | 2 |
optimizer | AdamW |
enable_fdsp | false |
enable_ddp | true |
use_fp16 | true |
Evaluation
The model was evaluated using the Word Error Rate (WER) metric from the evaluate
library.
Results
Test set | CV | FL | MLC |
---|---|---|---|
Spanish | 5.22 | 4.09 | 21.86 |
German | 7.11 | 7.79 | 32.77 |
Dutch | 6.83 | 8.65 | - |
Portuguese | 9.39 | 4.86 | 51.75 |
Galician | 12.70 | 9.98 | - |
English | 12.94 | 6.34 | 46.56 |
Polish | 14.19 | 8.68 | - |
Czech | 11.16 | 11.32 | - |
French | 11.24 | 7.83 | 42.05 |
Hungarian | 14.59 | 16.87 | - |
Italian | 6.01 | 3.32 | 36.13 |
Swedish | 15.99 | 10.94 | - |
Romanian | 17.39 | 9.65 | - |
Danish | 18.81 | 14.65 | - |
Basque | 19.96 | - | - |
Bulgarian | 24.26 | 15.20 | - |
Finnish | 22.61 | 15.29 | - |
Latvian | 27.12 | 17.23 | - |
Lithuanian | 28.27 | 24.30 | - |
Greek | 30.06 | 18.35 | - |
Slovak | 35.84 | 9.71 | - |
Slovenian | 34.72 | 19.41 | - |
Estonian | 37.19 | 19.83 | - |
Welsh | 50.40 | 39.96 | - |
Serbian | 56.49 | 27.60 | - |
Maltese | 58.84 | 44.89 | - |
Breton | 95.68 | - | - |
Irish | 82.23 | 88.06 | - |
Acknowledgements
