Papers
arxiv:2510.15710

Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Published on Oct 17
ยท Submitted by Junzhi Ning on Oct 22
Authors:
Wei Li ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A unified multimodal medical model integrates image understanding and generation, enhancing performance across various medical vision-language tasks.

AI-generated summary

Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.

Community

Paper author Paper submitter

teaser

overview_ver3

fig_results_ver2

topline_performance

text2img1

text2img2

vae_demo_ver1

visual_question_answering

reportgeneration

Counterfactual

cxr

ct

mri

ultrasound

endoscopy

oct

retinal

his

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.15710 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.15710 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.