arxiv:2511.13703

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Published on Nov 17

· Submitted by

Lavender Jiang on Nov 21

New York University

Upvote

Authors:

Lavender Y. Jiang ,

Angelica Chen ,

Radhika Dua ,

Robert Steele ,

Daniel A. Alber ,

Abstract

Lang1, a specialized language model pretrained on clinical data, outperforms generalist models in predicting hospital operational metrics through supervised finetuning and real-world evaluation.

AI-generated summary

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

View arXiv page View PDF Add to collection

Community

LavenderJ

Paper author Paper submitter 1 day ago

Can smaller, specialized models beat massive generalists in healthcare operations?

Healthcare operations (bed management, discharge planning, risk prediction) are high-stakes and data-sensitive. In this work, we explore the tradeoffs between off-the-shelf generalists and models specialized on internal patient notes. We introduce Lang1 (models) and ReMedE (benchmark).

🚀 The Models: Lang1

We introduce a family of decoder LLMs (100M, 1B, 7B) pretrained on 80B tokens of EHR notes + 627B web tokens.

David vs. Goliath: On hospital operation tasks, Lang1 (1B) outperforms DeepSeek R1 671B and LoRA-finetuned DeepSeek Distilled Llama 70B.
Zero-Shot Transfer: Instruction finetuning allows Lang1 to transfer zero-shot to related tasks and different hospitals better than generalist models of similar scale.

🏥 The Benchmark: ReMedE

Existing benchmarks often rely on proxies that fail to capture real-world clinical constraints. We introduce ReMedE to fix this:

Scale: 5 clinical tasks across a multi-hospital system.
Depth: Data spanning 10 years (87k - 421k patients per task)
Realism: Uses time-based splits to mimic actual deployment conditions (no future peeking).

💡 Key Engineering Insights

Pretraining ≠ Performance: Next-token prediction on clinical notes builds comprehension, but Supervised Finetuning (SFT) is required for high performance on operational tasks.
Efficiency: Lower zero-shot perplexity from in-domain pretraining enables more data-efficient SFT.
Robustness: Larger specialized models handle temporal shifts (changes in data distributions over time) much better than generalists.

💭 Why This Matters

Feasibility & Cost: Training Lang1-1B cost roughly $180k (estimated on AWS), a figure comparable to routine hospital IT upgrades. This proves health systems can "build institutional assets" rather than "renting intelligence" via APIs, ensuring better data privacy and long-term control.
Operations > Diagnostics: With physicians spending ~74% of their time on documentation and logistics, focusing AI on operational tasks (unlike pure diagnostics) offers immediate, measurable ROI for healthcare delivery.
The Future of Clinical AI: Our findings challenge the assumption that internet-scale generalists are the solution for every domain. For high-stakes environments, smaller, domain-specific systems offer a more accurate, affordable, and robust path forward.