Model Details

Model Description

A 17.31M parameter multilingual linear projector version 2 trained for automatic speech recognition (ASR) using the SLAM-ASR speechLLM framework. Within this framework, only the linear projector was trained alongside a frozen speech encoder (Whisper-large-v3-turbo) and frozen LLM (EuroLLM-1.7B).

  • Developed by: SpeechTek Unit at Fondazione Bruno Kessler
  • Funded by: This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
  • Model type: Linear projector in a speechLLM framework
  • Supported Language(s): English, French, German, Italian, Spanish, Portuguese, Dutch, Polish, Hungarian, Czech, Romanian, Bulgarian, Slovak, Slovene, Serbian, Greek, Danish, Swedish, Finnish, Latvian, Lithuanian, Estonian, Welsh, Maltese, Breton, Irish, Galician, and Basque.
  • License: CC-BY-4.0

Uses

This model is trained for Automatic Speech Recognition (ASR) and is meant to be the version 2 of the mEUltilingual speechLLM projectors collection.

How to Get Started with the Model

This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the SLAM-ASR codebase. Kindly refer to the instructions there for further details.

Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.

Training Details

Training Data

The linear projector was trained with a multilingual dataset covering 28 European languages, that relys on widely used speech datasets: Common Voice 17.0, Fleurs, and Vox-Populi. As the distribution of data across languages is highli imbalanced, we applied a cap of 100K audio samples per language per dataset, discarding any additional samples beyond this threshold. This strategy allowed us to reduce data skew while keeping training computationally feasible. To assess the generalizability and robustness of our models on out-of-domain speech, we used the official evaluation set of the INTERSPEECH 2025 MLC-SLM Challenge.

Training Procedure

  • The model was trained using the code-based provided by the official SLAM-ASR Github repository with torchrun.
  • Only the linear projector was trained.
  • The whisper-large-v3-turbo speech encoder (Whisper-large-v3-turbo) and LLM (EuroLLM-1.7B) were kept frozen, but applying LoRA during training.
  • A single monolingual prompt has been used for the training: "Transcribe speech to text."
  • Training was conducted with one NVIDIA Ada Lovelace L40S GPU.

Training Hyperparameters

llm_name eurollm-1.7b
llm_dim 2048
context_length 4096
encoder_name whisper
encoder_projector_ds_rate 5
encoder_dim 1280
encoder_projector linear
input_type mel
mel_size 128
epochs 3
freeze_encoder true
freeze_llm true
warmup_steps 1000
total_steps 100000
lr 1e-4
validation_interval 1000
batch_size_training 4
val_size_training 4
num_workers_dataloader 2
optimizer AdamW
enable_fdsp false
enable_ddp true
use_fp16 true

Evaluation

The model was evaluated using the Word Error Rate (WER) metric from the evaluate library.

Results

Test set CV FL MLC
Spanish 5.22 4.09 21.86
German 7.11 7.79 32.77
Dutch 6.83 8.65 -
Portuguese 9.39 4.86 51.75
Galician 12.70 9.98 -
English 12.94 6.34 46.56
Polish 14.19 8.68 -
Czech 11.16 11.32 -
French 11.24 7.83 42.05
Hungarian 14.59 16.87 -
Italian 6.01 3.32 36.13
Swedish 15.99 10.94 -
Romanian 17.39 9.65 -
Danish 18.81 14.65 -
Basque 19.96 - -
Bulgarian 24.26 15.20 -
Finnish 22.61 15.29 -
Latvian 27.12 17.23 -
Lithuanian 28.27 24.30 -
Greek 30.06 18.35 -
Slovak 35.84 9.71 -
Slovenian 34.72 19.41 -
Estonian 37.19 19.83 -
Welsh 50.40 39.96 -
Serbian 56.49 27.60 -
Maltese 58.84 44.89 -
Breton 95.68 - -
Irish 82.23 88.06 -

Acknowledgements

This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including SpeechTek/mEUltilingual_speechllm_linear_projector_v2