Hospitals increasingly rely on machine learning to identify ward patients at risk of deterioration, yet concerns remain about whether such tools behave fairly once embedded in care. An evaluation of CHARTwatch, a real-time early warning system deployed on general internal medicine wards at a large academic hospital, examined performance across sociodemographic groups and downstream effects on care. Using a historical control period and an implementation window from late 2020 to mid-2022, the assessment compared predictions, changes in clinical processes and key outcomes. Subgroups were defined by age, sex, homelessness and neighbourhood measures of material resources and racialised or newcomer composition. A protocolised response pathway accompanied high-risk alerts, aiming to standardise reassessment while allowing clinical discretion.

 

Design, Cohort and Model Characteristics

The evaluation compared a pre-deployment control period with the phase during which CHARTwatch operated in routine care. High-risk status was retrospectively computed for the control period to enable like-for-like comparisons. Admissions for COVID-19 or influenza and those with pre-admission palliative status or comfort measures were excluded to align with model development and prior validation.

 

Must Read: Fleet Management for Mobile Workstations Streamlines Care

 

CHARTwatch generated regular predictions of in-hospital death or transfer to intensive care. A time-aware modelling approach produced risk scores that fed a logistic regression classifier. Before deployment, overall discrimination and calibration metrics indicated acceptable performance for clinical decision support. Sensitivity was broadly consistent across measured subgroups. Specificity varied by age, with lower values among the oldest patients compared with younger groups, while no meaningful differences in specificity were observed by sex, homelessness or neighbourhood indices. Subgroup definitions were predefined and drawn from routinely captured data, including a validated algorithm to identify homelessness and characterise neighbourhood context. Because of small numbers of outcome events, intersectional analyses were not performed. To support fair comparisons, baseline covariates were balanced between periods using propensity score overlap weighting.

 

Care Processes Following High-Risk Alerts

Deployment was paired with a standardised response pathway that sought to promote timely reassessment and care planning when alerts signalled high risk. After deployment, documentation and monitoring intensified in ways aligned with this pathway. Across subgroups, vital signs were recorded more frequently. Use of antibiotics increased. Glucocorticoid prescribing rose among several groups, including younger and older adults, males and residents of areas with relatively greater material resources or lower racialised or newcomer composition. Imaging patterns changed modestly, with a reduction in MRI use and no consistent differences for X-ray, CT, ultrasound or intravenous fluids at subgroup level.

 

One equity-relevant signal concerned patients experiencing homelessness. In this subgroup, new code status orders rose markedly after deployment, from 2.7% to 27.5%, whereas no significant change was observed among patients not experiencing homelessness. The increase aligned with the pathway’s emphasis on structured reassessment and goals-of-care discussions for high-risk patients, suggesting that a protocolised response linked to alerts may help address previously missed processes of care for a group with lower baseline documentation.

 

Outcomes and Interpretation Across Subgroups

During the deployment period, the weighted rate of non-palliative in-hospital death declined compared with control, moving from 2.1% to 1.6%. Within subgroups, differences in non-palliative mortality were not statistically significant after adjustment for multiple comparisons. There were also no significant changes by subgroup in overall mortality or transfers to intensive care. These results point to a pattern in which predictions were broadly equitable on sensitivity, age-related variation in specificity was evident, and downstream process measures became more consistent without translating into clear subgroup differences in hard outcomes during the observation window.

 

Interpretation requires caution. Some sociodemographic variables of interest, such as race, gender identity and language, were unavailable or unreliable at the point of capture, so neighbourhood-level proxies for race and income were used. Data on how clinicians engaged with alerts were not available, limiting insight into whether variation in response rather than model predictions drove observed differences. Intersectional analyses were not feasible due to small event counts. Finally, as a non-randomised pre–post evaluation, estimates are associative despite efforts to balance measured covariates between periods.

 

Within a large general internal medicine service, a real-time early warning model showed generally consistent sensitivity across measured sociodemographic groups and lower specificity among the oldest patients. Deployment was associated with more standardised monitoring and selected treatments after high-risk alerts, alongside a pronounced increase in code status documentation among patients experiencing homelessness. While subgroup differences in mortality and intensive care transfer were not detected, the combination of equitable predictions on sensitivity, targeted process changes and a notable improvement in goals-of-care documentation for an underserved group offers a pragmatic template for equity-focused assessments of machine learning clinical decision support. Structured evaluations spanning predictions, processes and outcomes can help identify where standardisation reduces gaps and where further work is needed to ensure fair, reliable benefit across patient groups.

 

Source: JAMIA Open

Image Credit: iStock


References:

Colacci M, Pou-Prom C, Siddiqi A et al. (2025) Evaluating sociodemographic bias in a deployed machine-learned patient deterioration model. JAMIA Open, 8(6):ooaf158.



Latest Articles

early warning model, machine learning healthcare, equity in AI, patient deterioration, hospital AI, clinical decision support, health equity Equity analysis of a deployed early warning model reveals fair sensitivity and care process impacts.