A large language model built from structured electronic health records showed consistent predictive performance across multiple clinical tasks and external health systems. The framework uses structured electronic health records to generate predictions for key clinical outcomes, including mortality, readmission, intensive care unit admission and treatment recommendations. Development draws on data from 42,160 patients within a large health system and includes validation across three external institutions. The approach addresses persistent challenges in clinical decision-making, where fragmented data and complex workflows limit the adoption of predictive models. By integrating multimodal clinical information and aligning outputs with established diagnostic criteria, the system aims to deliver consistent and interpretable predictions within routine care processes.
A Generalist Model for Clinical Prediction
The framework introduces a clinical language model designed to operate as a diagnostic generalist across multiple prediction tasks. It integrates structured and unstructured data from electronic health records, including demographics, laboratory results, imaging reports and diagnostic codes, into a unified representation. Imaging encounters provide a consistent entry point, allowing clinical variables to be aligned around specific diagnostic events such as CT pulmonary angiography, cardiac CT angiography, cardiac magnetic resonance imaging and right upper quadrant ultrasound. This structure supports consistent cohort definition and reduces variability in data selection.
The model is built on a transformer-based architecture with seven billion parameters and undergoes pre-training followed by task-specific fine-tuning. The pipeline includes data collection, zero-shot evaluation, fine-tuning with structured supervision and deployment across multiple institutions. Clinical preference alignment combines established diagnostic criteria with hierarchical diagnostic coding systems to guide model outputs. This alignment supports reasoning patterns that follow recognised clinical pathways while maintaining flexibility across conditions. The system processes heterogeneous data without extensive manual feature engineering, addressing a key limitation of traditional predictive models that depend on structured numerical inputs and complex preprocessing workflows.
Must Read: Managing LLM Risk in Healthcare
Performance Across Predictive Tasks
Evaluation covers four core prediction tasks: 90-day mortality, 30-day intensive care unit admission, 30-day hospital readmission and treatment recommendation. Performance is assessed under three configurations, including a fine-tuned model, a structured zero-shot version and an unstructured zero-shot baseline. The fine-tuned model achieves a mean area under the receiver operating characteristic curve of 0.84, with consistent improvements over both zero-shot configurations. Gains reach 9.2% compared with structured zero-shot performance and 27.9% compared with unstructured inputs.
Mortality prediction demonstrates stable discrimination across time horizons, including in-hospital and post-discharge outcomes. Temporal risk stratification across multiple time intervals reaches an accuracy of 0.73, indicating calibrated prediction across short-term and longer-term outcomes. Intensive care unit admission prediction achieves an area under the curve of 0.86, while readmission prediction reaches 0.85. Treatment recommendation, evaluated within an ultrasound-based cohort, achieves an area under the curve of 0.79. Across all tasks, improvements are linked to structured data integration and alignment with clinical reasoning frameworks.
Internal validation across imaging modalities shows consistent performance gains over both structured and unstructured baseline models. Improvements range from 0.08 to 0.15 compared with structured zero-shot configurations and from 0.20 to 0.30 compared with unstructured inputs. These results indicate that task-specific fine-tuning and structured data alignment enhance predictive accuracy across pulmonary, cardiac and abdominal imaging contexts.
Robustness Across Institutions and Data Sources
External validation includes datasets from three additional health systems, covering diverse patient populations and imaging modalities. Performance remains stable across institutions, with only modest differences between internal and external evaluations. Mean differences in predictive performance remain low, indicating consistent generalisation across clinical environments. Across eight prediction tasks, external performance closely mirrors internal results, supporting deployment across heterogeneous healthcare settings.
Comparisons with other large language models show that structured input and clinical alignment improve performance across architectures. Models incorporating preference-aligned retrieval demonstrate higher accuracy and reduced variability compared with unstructured configurations. Across multiple models, performance increases when structured clinical context is introduced, highlighting the importance of guided data selection and alignment with clinical knowledge.
Multimodal input analysis shows that predictive accuracy improves as additional data sources are incorporated. Starting from demographic information alone, performance increases progressively with the inclusion of laboratory data, diagnostic codes and imaging reports. Across all evaluated models, full multimodal integration produces the highest accuracy, demonstrating the value of combining structured and narrative clinical data.
Comparisons with conventional machine learning and deep learning models further illustrate performance differences across input conditions. Traditional models show limited advantages when only basic demographic information is available, but this advantage diminishes as additional clinical data are included. With full multimodal input, the language model demonstrates higher accuracy than logistic regression, random forest, support vector machine, multilayer perceptron and transformer-based baselines. External validation reflects similar trends, with performance gains increasing as clinical context expands.
The framework demonstrates that large language models can integrate heterogeneous clinical data and deliver consistent predictive performance across multiple tasks and institutions. Structured alignment with clinical criteria and multimodal data integration support improvements in accuracy, interpretability and generalisation. Performance remains stable across internal and external datasets, indicating suitability for deployment in diverse healthcare settings. The results highlight the role of clinically aligned language models as unified predictive systems capable of supporting decision-making across diagnostic and prognostic workflows without reliance on task-specific pipelines.
Source: npj Digital Medicine
Image Credit: iStock
References:
Wang Y, Dai Y, Wang R et al. (2026) Integrating large language models for enhanced predictive analytics in healthcare. npj Digit Med: In Press.