Clinical artificial intelligence is increasingly assessed through measures of accuracy, yet those measures do not fully answer whether a model can be trusted in routine care. The Safety-Aware Receiver Operating Characteristic, or SA-ROC, addresses that gap by focusing on operational safety rather than discrimination alone. It uses positive predictive value and negative predictive value thresholds to define when an AI output is reliable enough for autonomous rule-in or rule-out decisions. Predictions that do not meet either threshold are placed in a Gray Zone, where human review remains necessary. This structure turns model evaluation into a workflow question as well as a performance question. It also introduces Gray Zone Area as a way to quantify the burden created by uncertainty. By linking safety requirements to automation capacity, the framework provides a more practical basis for comparing models, setting decision thresholds and planning clinical use.

 

Turning Model Scores into Clinical Decision Zones

SA-ROC defines 3 operational zones. The Rule-in Safe Zone contains predictions that meet the required positive predictive value. The Rule-out Safe Zone contains predictions that meet the required negative predictive value. Predictions between those zones fall into the Gray Zone, where neither safety condition is satisfied and human review remains essential.

 

This approach replaces a single threshold with a structured workflow. Cases in the rule-out zone can be deprioritised, while cases in the rule-in zone can be escalated. Cases in the Gray Zone continue to require clinician assessment. The framework therefore maps AI output to autonomous de-prioritisation, essential human review and autonomous escalation.

 

Must Read: Explainable AI Advances Sepsis Prediction in ICUs

 

The framework can be configured in two ways. One method sets safety targets directly, allowing the user to choose explicit predictive value requirements for rule-in and rule-out decisions. The second method uses utility maximisation, combining the benefit of correct decisions with the cost of errors and deferral. That makes it possible to tailor thresholds to different clinical priorities.

 

Gray Zone Area adds an operational measure to this structure. A smaller Gray Zone means a larger share of cases can be handled safely by automation. A larger Gray Zone means uncertainty remains high and more cases must stay under manual review. The Safety Profile Curve extends this by showing how Gray Zone size changes as safety requirements become stricter. In that way, the framework makes the trade-off between safety and automation visible rather than implicit.

 

Why Similar AUC Values Can Mask Different Safety Profiles

The analysis showed that models with similar AUC values can behave very differently once safety requirements are applied. AUC alone did not capture how score distributions affected operational use. What mattered was not only discrimination, but how prediction scores were positioned across positive and negative cases.

 

In simulated examples, one model with modest AUC produced a strong Rule-in Safe Zone because scores for positive cases were concentrated at the high end. That profile suited confident case escalation. Another model with a similar AUC showed the opposite pattern, producing a broad Rule-out Safe Zone better suited to safely excluding low-risk cases. A high AUC also did not guarantee efficient automation. One high-performing model with wide and overlapping score distributions produced a large Gray Zone and high operational uncertainty. Another model with the same high AUC but narrower, better separated distributions produced a much smaller Gray Zone and more reliable automation.

 

The retrospective mammography comparison reinforced that point. AI Solution #1 achieved an AUC of 0.928, compared with 0.882 for AI Solution #2. At less stringent safety levels, AI Solution #1 reduced the Gray Zone more quickly and offered greater operational efficiency. Under the strictest rule-out requirement, that advantage reversed. At a safety level of 100%, AI Solution #2 correctly ruled out 290 true negative cases, removing 29.0% of the cohort from radiologist workload. AI Solution #1 ruled out 167 cases, or 16.7% of the cohort. The difference of 123 cases showed that the higher-AUC model was not the stronger option for maximum-confidence screening.

 

Implications for Clinical Workflow and Model Governance

The framework also provided a way to examine human performance within AI-defined safety zones. In the mammography analysis, the greatest human-AI discordance appeared in the Rule-out Safe Zone. In the region where the AI was most confident that cases were low risk, radiologist accuracy was lowest because of a high level of false positives. In the Rule-in Safe Zone, radiologist accuracy remained high. The Gray Zone marked the region where difficulty persisted and expert review remained most important.

 

These findings position SA-ROC as more than a technical evaluation tool. It acts as a governance framework that allows safety policy to be translated into operational design. A balanced policy can set moderate reliability thresholds in both directions. A referral-focused policy can impose a very high positive predictive value requirement. A screening-focused policy can prioritise near-perfect negative predictive value. More restrictive utility settings can also be used to minimise specific harms, such as missed cancers or unnecessary positive calls.

 

Several constraints remain. The evaluation used a retrospective dataset of 1,000 mammograms with 40% disease prevalence. Threshold stability at very high safety levels will require larger real-world validation. Because the framework depends on predictive values, performance will also vary with prevalence, making broader testing across centres and settings important. Future work is planned for more complex tasks, including multiclass problems and large language model assessment.

 

SA-ROC shifts clinical AI evaluation from accuracy alone to operational safety under defined reliability targets. By separating outputs into rule-in, rule-out and Gray Zone decisions, it shows when automation can be trusted and when human review must remain in place. The mammography comparison demonstrated that the model with the higher AUC was not the better option for the strictest screening objective. That result underlines the need to judge clinical AI by how it performs within care pathways, not only by aggregate statistical performance. The framework also turns uncertainty into a measurable operational burden, giving clinical teams a way to compare models, define policy and align deployment decisions with workflow demands.

 

Source: npj Digital Medicine

Image Credit: iStock


References:

Kim YT, Kim H, Bahl M et al. (2026) Defining operational safety in clinical artificial intelligence systems. npj Digit Med: In Press.



Latest Articles

clinical AI safety, SA-ROC framework, healthcare AI evaluation, predictive value thresholds, gray zone analysis, AI workflow automation, medical AI governance SA-ROC reframes clinical AI safety beyond accuracy, defining rule-in, rule-out and gray zones to align automation with reliable clinical workflows.