MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
π EMNLP 2025
Khai Le-Duc*, Tuyen Tran*, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh**, Thanh Nguyen-Tang**
*Equal contribution | **Equal supervision
β If you find this work useful, please consider starring the repo and citing our paper!
π§ Abstract
Multilingual speech translation (ST) in the medical domain enhances patient care by enabling effective communication across language barriers, alleviating workforce shortages, and improving diagnosis and treatment β especially in global health emergencies.
In this work, we introduce MultiMed-ST, the first large-scale multilingual medical speech translation dataset, spanning all translation directions across five languages:
π»π³ Vietnamese, π¬π§ English, π©πͺ German, π«π· French, π¨π³ Traditional & Simplified Chinese.
With 290,000 samples, MultiMed-ST represents:
- π§© the largest medical MT dataset to date
- π the largest many-to-many multilingual ST dataset across all domains
We also conduct the most comprehensive ST analysis in the field's history, to our best knowledge, covering:
- β Empirical baselines
- π Bilingual vs. multilingual study
- π§© End-to-end vs. cascaded models
- π― Task-specific vs. multi-task seq2seq approaches
- π£οΈ Code-switching analysis
- π Quantitative & qualitative error analysis
All code, data, and models are publicly available: π GitHub Repository
π§° Repository Overview
This repository provides scripts for:
- ποΈ Automatic Speech Recognition (ASR)
- π Machine Translation (MT)
- π Speech Translation (ST) β both cascaded and end-to-end seq2seq models
It includes:
- βοΈ Model preparation & fine-tuning
- π Training & inference scripts
- π Evaluation & benchmarking utilities
π¦ Dataset & Models
- Dataset: π€ Hugging Face Dataset
- Fine-tuned Models: π€ Hugging Face Models
You can explore and download all fine-tuned models for MultiMed-ST directly from our Hugging Face repository:
πΉ Whisper ASR Fine-tuned Models (Click to expand)
| Language | Model Link |
|---|---|
| Chinese | whisper-small-chinese |
| English | whisper-small-english |
| French | whisper-small-french |
| German | whisper-small-german |
| Multilingual | whisper-small-multilingual |
| Vietnamese | whisper-small-vietnamese |
πΉ LLaMA-based MT Fine-tuned Models (Click to expand)
| Source β Target | Model Link |
|---|---|
| Chinese β English | llama_Chinese_English |
| Chinese β French | llama_Chinese_French |
| Chinese β German | llama_Chinese_German |
| Chinese β Vietnamese | llama_Chinese_Vietnamese |
| English β Chinese | llama_English_Chinese |
| English β French | llama_English_French |
| English β German | llama_English_German |
| English β Vietnamese | llama_English_Vietnamese |
| French β Chinese | llama_French_Chinese |
| French β English | llama_French_English |
| French β German | llama_French_German |
| French β Vietnamese | llama_French_Vietnamese |
| German β Chinese | llama_German_Chinese |
| German β English | llama_German_English |
| German β French | llama_German_French |
| German β Vietnamese | llama_German_Vietnamese |
| Vietnamese β Chinese | llama_Vietnamese_Chinese |
| Vietnamese β English | llama_Vietnamese_English |
| Vietnamese β French | llama_Vietnamese_French |
| Vietnamese β German | llama_Vietnamese_German |
πΉ m2m100_418M MT Fine-tuned Models (Click to expand)
| Source β Target | Model Link |
|---|---|
| de β en | m2m100_418M-finetuned-de-to-en |
| de β fr | m2m100_418M-finetuned-de-to-fr |
| de β vi | m2m100_418M-finetuned-de-to-vi |
| de β zh | m2m100_418M-finetuned-de-to-zh |
| en β de | m2m100_418M-finetuned-en-to-de |
| en β fr | m2m100_418M-finetuned-en-to-fr |
| en β vi | m2m100_418M-finetuned-en-to-vi |
| en β zh | m2m100_418M-finetuned-en-to-zh |
| fr β de | m2m100_418M-finetuned-fr-to-de |
| fr β en | m2m100_418M-finetuned-fr-to-en |
| fr β vi | m2m100_418M-finetuned-fr-to-vi |
| fr β zh | m2m100_418M-finetuned-fr-to-zh |
| vi β de | m2m100_418M-finetuned-vi-to-de |
| vi β en | m2m100_418M-finetuned-vi-to-en |
| vi β fr | m2m100_418M-finetuned-vi-to-fr |
| vi β zh | m2m100_418M-finetuned-vi-to-zh |
| zh β de | m2m100_418M-finetuned-zh-to-de |
| zh β en | m2m100_418M-finetuned-zh-to-en |
| zh β fr | m2m100_418M-finetuned-zh-to-fr |
| zh β vi | m2m100_418M-finetuned-zh-to-vi |
π¨βπ» Core Developers
- Khai Le-Duc
University of Toronto, Canada
π§ duckhai.le@mail.utoronto.ca
π https://github.com/leduckhai
- Tuyen Tran: π§ tuyencbt@gmail.com
Hanoi University of Science and Technology, Vietnam
- Nguyen Kim Hai Bui: π§ htlulem185@gmail.com
EΓΆtvΓΆs LorΓ‘nd University, Hungary
π§Ύ Citation
If you use our dataset or models, please cite:
π arXiv:2504.03546
@inproceedings{le2025multimedst,
title={MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation},
author={Le-Duc, Khai and Tran, Tuyen and Tat, Bach Phan and Bui, Nguyen Kim Hai and Anh, Quan Dang and Tran, Hung-Phong and Nguyen, Thanh Thuy and Nguyen, Ly and Phan, Tuan Minh and Tran, Thi Thu Phuong and others},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={11838--11963},
year={2025}
}
Model tree for leduckhai/MultiMed-ST
Base model
facebook/m2m100_418M