Sepsis remains a major cause of in-hospital mortality worldwide and places a substantial burden on intensive care units and healthcare systems. Despite regular updates to clinical guidance and advances in treatment, outcomes for many patients remain poor and early identification of those at highest risk continues to be challenging. Conventional risk scores and traditional machine learning models have contributed to prognostic assessment but can struggle with the complexity of high-dimensional intensive care data and often provide limited interpretability. Recent progress in large language models has opened new opportunities for handling complex clinical information, yet many applications rely only on simple prompting and underuse formal clinical guidance. A new approach now evaluates a large language model that is explicitly fine-tuned with structured Surviving Sepsis Campaign guidance to improve sepsis mortality prediction and support decision-making in intensive care units.
From ICU Data to Guideline-Aware Knowledge Graph
Model development draws on the MIMIC-IV database, which contains electronic health records from more than 500,000 hospital admissions, including detailed intensive care data between 2008 and 2019. From this resource, 32,970 cases meeting Sepsis-3 criteria were identified. After excluding patients younger than 18 years, intensive care stays shorter than 24 hours or longer than 100 days and repeated admissions, 24,237 unique adult sepsis cases remained. These were divided by stratified random sampling in an 8:1:1 ratio into training, validation and test cohorts, with 19,389, 2,424 and 2,424 patients respectively. The primary endpoint was all-cause in-hospital mortality after the first 24 hours of intensive care, and standard severity scores such as SOFA, SAPS-II and APS-III were recalculated for benchmarking.
For each patient, demographic characteristics, vital signs, laboratory test results and key interventions during the first 24 hours were extracted and processed. Implausible measurements were removed, extreme outliers were winsorised, missing values were addressed by multiple imputation and categorical variables were encoded. Continuous variables were standardised before established scores were recalculated, ensuring a consistent feature set for traditional models and the large language model.
Must Read: Evaluating Digital Sepsis Screening Tools Across NHS Trusts
The 2021 adult Surviving Sepsis Campaign guideline served as the sole structured knowledge source. Guideline text was segmented into individual recommendations and processed by an instruction-tuned large language model to identify core entities such as indicators, thresholds, actions, time frames and outcomes, together with relations linking thresholds to recommended responses. Two intensive care specialists reviewed and consolidated these outputs into a curated sepsis knowledge base, which was then converted into a machine-readable knowledge graph for downstream use.
Fine Tuning and Performance of the Guideline-Enhanced LLM
Two related fine-tuning datasets were constructed. One contained only the processed clinical variables without additional context. The other appended structured interpretations derived from the guideline knowledge graph to each patient’s worst values in the first 24 hours. For example, lactate values above specified thresholds were labelled as high risk, mean arterial pressure below 65 mmHg was linked to vasopressor initiation and values within guideline ranges were explicitly noted. These annotations were incorporated into prompts for a pre-trained Qwen2.5-72B large language model, which was fine-tuned with supervised learning and Low-Rank Adaptation to build a parameter-efficient, guideline-aware mortality predictor.
Within the final cohort, in-hospital mortality was 14.7%, with 3,568 deaths among 24,237 patients. Non-survivors were older, had higher SOFA and SAPS-II scores, showed more deranged physiology and laboratory values, and required more vasopressors, mechanical ventilation and renal replacement therapy within 24 hours. This heterogeneity framed the challenge for all models.
Traditional machine learning comparators included support vector machine, Naive Bayes, k-nearest neighbour, logistic regression, decision tree, random forest and gradient boosting decision tree. Deep learning baselines comprised a multilayer perceptron, long short-term memory network, convolutional neural network and Transformer. An unfine-tuned large language model relying on prompt-only inference was also assessed. Hyperparameters for baselines were tuned using cross-validated search, and all models were evaluated on an independent 10% test cohort.
The guideline-enhanced large language model delivered the strongest overall performance, achieving an accuracy of 0.819, F1-score of 0.815, sensitivity of 0.815, specificity of 0.822 and an area under the receiver operating characteristic curve of 0.852. Among traditional models, gradient boosting decision trees and random forests performed best, with accuracies of 0.774 and 0.770 and AUCs of 0.850 and 0.831. Deep learning models showed moderate performance, with an LSTM accuracy of 0.762 and AUC of 0.841.
Added Value of Guideline Integration and Remaining Limitations
Ablation experiments highlighted the contribution of supervised fine-tuning and explicit guideline integration. A prompt-only large language model reached an accuracy of 0.709, F1-score of 0.678 and AUC of 0.706. Adding supervised fine-tuning without any guideline information increased accuracy to 0.786, F1-score to 0.778 and AUC to 0.801. The combination of fine-tuning and guideline-based annotations produced the best results, with accuracy of 0.819, F1-score of 0.815 and AUC of 0.852. Testing a separate large language model with the same prompt templates yielded intermediate performance, supporting the importance of both adaptation to local data and structured clinical knowledge.
The guideline-enhanced model also generated mortality risk predictions linked to evidence-based sepsis pathways. Because annotations were derived from the knowledge graph, outputs could be accompanied by guideline-based rationales, reducing reliance on external explainability tools and contrasting with earlier machine learning models that depended heavily on post hoc feature importance analysis.
Several limitations were noted that affect wider implementation. Development and validation used data from a single tertiary urban hospital intensive care unit with a predominantly Caucasian population and a median age of around 65 years, which may constrain generalisability to other populations and settings. Fine-tuning a 72-billion-parameter large language model required substantial computational resources, including multiple high-end graphics processing units and supporting infrastructure, which may be difficult to reproduce in many institutions. Dependence on a single set of adult sepsis guidelines raised concerns about applicability to paediatric, obstetric or non-Western practice environments and the need to keep the knowledge graph aligned with evolving evidence. In addition, evaluation was based on retrospective data and did not address integration with real-time monitoring, streaming data or the potential influence on clinical workflows and alert fatigue.
Integrating adult Surviving Sepsis Campaign guidance into large language model fine-tuning produced a sepsis mortality predictor that outperformed traditional clinical scores, machine learning baselines and deep learning comparators across several metrics while offering guideline-aligned explanations. The findings indicate that explicit domain knowledge can enhance both predictive performance and interpretability in critical care artificial intelligence tools. Further multicentre validation and work on technical integration with intensive care data streams will be important to assess how such models support routine sepsis risk stratification and timely intervention planning in everyday practice.
Source: Health Informatics Journal
Image Credit: iStock