Sepsis remains a leading cause of mortality in intensive care units, where rapid physiological deterioration demands timely and well-informed clinical decisions. Conventional scoring systems such as the Sequential Organ Failure Assessment and the Acute Physiology and Chronic Health Evaluation II provide structured assessments of organ dysfunction and mortality risk yet may be limited in identifying subtle early changes within complex ICU data. Machine learning and deep learning models have demonstrated strong predictive performance for sepsis onset, ICU readmission and in-hospital mortality. However, their “black-box” nature has raised concerns regarding transparency, clinician trust and safe integration into high-stakes care environments. A systematic review of studies published between 2020 and 2025 examines how explainable artificial intelligence methods enhance both predictive accuracy and interpretability, with the aim of supporting more transparent and clinically actionable decision support in adult intensive care settings.

 

Scope and Methodological Approach

The review followed a structured protocol aligned with PRISMA 2020 guidance and focused on adult ICU populations. Studies published between January 2020 and January 2025 were identified through searches of ResearchGate, IEEE Xplore, Scopus, Web of Science and Google Scholar. Predefined keywords addressed artificial intelligence in ICUs, sepsis prediction, ICU readmission models, clinical decision support systems and explainable or interpretable machine learning.

 

Must Read: Explainable AI Visualisations for ICU Intubation Risk

 

Inclusion criteria required the use of interpretable or explainable AI techniques, clearly defined clinical outcomes such as sepsis onset or in-hospital mortality and the use of ICU or electronic health record datasets including MIMIC-III, MIMIC-IV, Emory University Hospital, UCSD and Ruijin Hospital. Studies limited to black-box models without interpretability components, paediatric populations or non-ICU settings were excluded.

 

The initial search identified 100 records. After removal of duplicates and title and abstract screening, 43 full-text studies met eligibility criteria and were included in the qualitative synthesis. Data extraction captured dataset characteristics, model type, clinical task, interpretability method and key predictive features. Representative studies demonstrated the application of random forests, XGBoost, deep neural networks and ensemble learning approaches combined with post hoc explanation techniques.

 

Predictive Performance and Key Clinical Features

Across the included studies, explainable AI models consistently achieved strong predictive performance for sepsis onset, ICU readmission and in-hospital mortality. Many reported area under the receiver operating characteristic curve values exceeding 0.84, with some early sepsis prediction models reaching 0.89. Internal validation approaches such as k-fold cross-validation and train–test splits often demonstrated higher performance, while external or temporal validation showed slightly reduced but more generalisable results.

 

Datasets such as MIMIC-III and MIMIC-IV were frequently used to train and validate models, alongside hospital-specific records from multi-centre cohorts. Cohort sizes ranged from several thousand to over 14,000 patients in selected representative studies. Validation strategies included temporal splits, five-fold cross-validation and independent external datasets.

 

Explainability techniques were widely integrated into these predictive pipelines. SHapley Additive exPlanations was the most frequently applied method, complemented by Local Interpretable Model-agnostic Explanations, Gradient-weighted Class Activation Mapping, feature importance measures and sensitivity analysis. These approaches enabled identification of physiologically meaningful predictors associated with sepsis and mortality risk.

 

Commonly highlighted features included respiratory rate, Glasgow Coma Scale, blood urea nitrogen, urine output and mean arterial pressure. Additional variables such as lactate, creatinine, heart rate, body temperature, platelet count and white blood cell count were repeatedly identified across models. The recurrence of these indicators across different datasets and modelling approaches suggests alignment between algorithmic outputs and established clinical risk factors.

 

Interpretability, Bias and Implementation Challenges

Beyond performance metrics, the review assessed methodological quality using the PROBAST tool across domains of participants, predictors, outcomes and analysis. Many studies demonstrated low risk of bias in participant selection and outcome definition, with clearly defined sepsis or mortality endpoints. However, variability emerged in reporting of predictor measurement and analytical strategies. Some studies lacked external validation or detailed handling of missing data, limiting assessment of generalisability.

 

Most included research relied on retrospective datasets. Prospective implementation in real-time ICU environments remains limited. Although interpretability methods provide transparency, there is no universally standardised metric for evaluating the quality or stability of explanations. This variability complicates comparison across studies and may affect clinical confidence in model outputs.

 

The review also notes that explainability does not automatically guarantee correctness. Explanations may reflect artefacts of overfitted models or data noise rather than clinically meaningful signals. Careful interpretation and integration with clinical expertise are therefore essential. Explainable AI systems are positioned as complementary tools that support, rather than replace, professional judgement.

 

Transferability across institutions represents another important consideration. Studies evaluating models across multiple hospitals reported that interpretable systems maintained more consistent performance in new settings compared with opaque models. Nonetheless, broader multi-centre prospective validation is required to confirm reliability across diverse patient populations and healthcare systems.

 

Evidence indicates that explainable artificial intelligence enhances transparency in sepsis and ICU mortality prediction while maintaining strong predictive performance. Techniques such as SHAP, LIME, Grad-CAM and sensitivity analysis enable identification of clinically relevant physiological indicators including respiratory rate, Glasgow Coma Scale and renal function markers. Despite encouraging results, most models remain retrospectively validated and methodological variability persists in assessing interpretability. Further work focused on external validation, prospective evaluation and clinician-centred integration into ICU workflows is required to ensure that explainable AI systems can be safely and effectively embedded within routine critical care practice.

 

Source: BMC Medical Informatics and Decision Making

Image Credit: iStock


References:

Athukorala VS & Ilmini W (2026) Explainable AI for critical care: a systematic review of interpretable models for sepsis and ICU mortality prediction. BMC Med Inform Decis Mak: In Press.




Latest Articles

Explainable AI, Sepsis prediction, ICU artificial intelligence, Machine learning healthcare, Clinical decision support, Interpretable AI models, Critical care analytics xplainable AI improves sepsis prediction in ICUs by enhancing transparency, accuracy and clinical decision support using interpretable machine learning models.