Papers
arxiv:2511.13703

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Published on Nov 17
ยท Submitted by Lavender Jiang on Nov 21
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Lang1, a specialized language model pretrained on clinical data, outperforms generalist models in predicting hospital operational metrics through supervised finetuning and real-world evaluation.

AI-generated summary

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

Community

Paper author Paper submitter

Can smaller, specialized models beat massive generalists in healthcare operations?

Healthcare operations (bed management, discharge planning, risk prediction) are high-stakes and data-sensitive. In this work, we explore the tradeoffs between off-the-shelf generalists and models specialized on internal patient notes. We introduce Lang1 (models) and ReMedE (benchmark).

๐Ÿš€ The Models: Lang1

We introduce a family of decoder LLMs (100M, 1B, 7B) pretrained on 80B tokens of EHR notes + 627B web tokens.

  • David vs. Goliath: On hospital operation tasks, Lang1 (1B) outperforms DeepSeek R1 671B and LoRA-finetuned DeepSeek Distilled Llama 70B.
  • Zero-Shot Transfer: Instruction finetuning allows Lang1 to transfer zero-shot to related tasks and different hospitals better than generalist models of similar scale.

๐Ÿฅ The Benchmark: ReMedE

Existing benchmarks often rely on proxies that fail to capture real-world clinical constraints. We introduce ReMedE to fix this:

  • Scale: 5 clinical tasks across a multi-hospital system.
  • Depth: Data spanning 10 years (87k - 421k patients per task)
  • Realism: Uses time-based splits to mimic actual deployment conditions (no future peeking).

๐Ÿ’ก Key Engineering Insights

  • Pretraining โ‰  Performance: Next-token prediction on clinical notes builds comprehension, but Supervised Finetuning (SFT) is required for high performance on operational tasks.
  • Efficiency: Lower zero-shot perplexity from in-domain pretraining enables more data-efficient SFT.
  • Robustness: Larger specialized models handle temporal shifts (changes in data distributions over time) much better than generalists.

๐Ÿ’ญ Why This Matters

  • Feasibility & Cost: Training Lang1-1B cost roughly $180k (estimated on AWS), a figure comparable to routine hospital IT upgrades. This proves health systems can "build institutional assets" rather than "renting intelligence" via APIs, ensuring better data privacy and long-term control.

  • Operations > Diagnostics: With physicians spending ~74% of their time on documentation and logistics, focusing AI on operational tasks (unlike pure diagnostics) offers immediate, measurable ROI for healthcare delivery.

  • The Future of Clinical AI: Our findings challenge the assumption that internet-scale generalists are the solution for every domain. For high-stakes environments, smaller, domain-specific systems offer a more accurate, affordable, and robust path forward.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.13703 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.13703 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.13703 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.