MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine
Paper • 2603.00842 • Published
MedGPT-oss is an open-weight 20B-parameter vision–language model for biomedicine, built on GPT-oss-20B with a CLIP-ViT-L/14@336px visual encoder and a two-layer MLP projector. It is trained with a three-stage curriculum (alignment → long-context mid-training → instruction tuning) and is designed for on-premises, privacy-preserving clinical research.
📄 Paper: arXiv:2603.00842
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
from PIL import Image
model_id = "UFNLP/MedGPT-oss"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
image = Image.open("chest_xray.png")
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe the findings in this chest X-ray."},
]}]
inputs = processor.apply_chat_template(
messages, images=[image], add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
@article{zhang2026medgptoss,
title = {MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine},
author = {Zhang, Kai and Yuan, Zhengqing and Peng, Cheng and Zhao, Songlin and
Lyu, Mengxian and Chen, Ziyi and Ye, Yanfang and Liu, Wei and
Zhang, Ying and Smith, Kaleb E. and He, Lifang and Sun, Lichao and Wu, Yonghui},
journal = {arXiv preprint arXiv:2603.00842},
year = {2026}
}
Lichao Sun (lis221@lehigh.edu) · Yonghui Wu (yonghui.wu@ufl.edu)