Artificial intelligence has shown strong performance in mammographic screening, yet implementation remains cautious because even high-performing models can miss cancers. A key challenge is knowing when the model’s prediction is reliable enough to act on without human input. Researchers evaluated a strategy that pairs a probability of malignancy score with an explicit measure of prediction uncertainty, allowing AI to take recall decisions only when confident and routing the rest to radiologist double reading. The aim was to test whether such a hybrid workflow could lower reading workload while maintaining cancer detection and recall rates in a national screening context.
Uncertainty-Guided Model and Reading Workflow
The team developed a noncommercial mammography interpretation pipeline that detects suspicious regions, classifies them with a ConvNeXt-tiny network, and aggregates findings into an examination-level probability of malignancy on a 1–100 scale. Alongside this score, the system generated an uncertainty estimate focused on the region classification stage, which was considered the main source of potential error at examination level given the region detector’s high sensitivity. Eight candidate uncertainty metrics were explored, derived either from Monte Carlo dropout distributions or from the probability output itself, and computed from either the most suspicious region or all regions. The dataset comprised 41,471 digital screening examinations from 15,524 women in the Dutch national programme from July 2003 to August 2018; images were acquired on Hologic-Lorad Selenia systems. Examinations had been double read with arbitration, and 2-year follow-up identified screen-detected and interval cancers. Ethical approval was waived.
Must Read: Improving Mammography Report Clarity with AI
A hybrid strategy required two thresholds: one for probability of malignancy and one for uncertainty. If uncertainty exceeded its threshold, the examination was routed to double reading irrespective of the malignancy score. If uncertainty was below the threshold, the AI’s malignancy score alone determined recall. Thresholds were optimised on half of the dataset and tested on the remaining half, using bootstrap resampling to compare cancer detection, recall, sensitivity and specificity with standard double reading. The optimisation sought to minimise the proportion sent to radiologists while keeping cancer detection at or above, and recall at or below, the corresponding means for double reading.
Workload Reduction Without Compromising Accuracy
Three uncertainty metrics enabled a split in which a portion of examinations could be acted on by AI alone while maintaining screening performance. The best performer was the entropy of the mean probability of malignancy for the most suspicious region. With this metric, 61.9% of examinations were referred for radiologist double reading and 38.1% were decided by AI, yielding a cancer detection rate of 6.6 per 1000 examinations and a recall rate of 23.7 per 1000. These outcomes were statistically similar to standard double reading at 6.7 and 23.9 per 1000, respectively. Among recalled women, 19.0% would have been recalled by AI alone under this configuration.
Receiver operating characteristic analyses showed that the AI model’s discrimination depended on its certainty classification. For examinations labeled uncertain, the model achieved an area under the curve of 0.87, whereas for examinations labeled certain the area under the curve was 0.96. At matched specificity within each group, AI sensitivity was lower than radiologists’ in the uncertain group, but comparable in the certain group, where AI sensitivity of 85.4% did not differ from double reading at 88.9%. Cancer prevalence was similar in the uncertain and certain groups at 8.6 and 9.8 per 1000, respectively. Breast density patterns differed between groups, with density C or D more frequent among uncertain cases, while age at diagnosis, tumour size and cancer type showed no clear differences.
Comparators, Alternative Splits and Limitations
Standalone AI at a point optimised to maintain cancer detection comparable to double reading produced a markedly higher recall rate of 52.8 per 1000 and lower specificity of 95.4%, underscoring the value of uncertainty-based routing rather than unconditional automation. A more conventional hybrid that ignored uncertainty and sent only the highest AI scores for double reading required radiologist review of 91.1% of examinations to preserve cancer detection and recall rates similar to double reading, offering far less workload relief than the uncertainty-guided split. Only one screen-detected cancer with a confident AI prediction would have been missed by the hybrid strategy; in contrast, cases that AI would have incorrectly dismissed under AI-only reading were captured by the hybrid because their predictions were classified as uncertain and thus sent to radiologists.
Several constraints frame interpretation. Uncertainty estimation was limited to the classification stage and did not model uncertainty from region detection. The retrospective design did not capture potential changes in radiologist behaviour under a new prevalence mix, nor did it measure reading time, so a 38.1% reduction in examinations does not necessarily equate to an equivalent reduction in minutes. All data came from a single screening unit and a single vendor’s digital mammography systems, which may limit generalisability to other settings or to digital breast tomosynthesis. Nonetheless, uncertainty metrics that were simpler to compute performed at least as well as more intensive methods in separating cases into higher and lower AI performance bands, suggesting feasibility for broader architectures.
An uncertainty-aware hybrid reading workflow allowed AI to make recall decisions only when confident and referred the remainder to double reading, reducing the share of examinations requiring radiologist interpretation to 61.9% while maintaining cancer detection and recall at levels comparable to standard double reading. Performance gains concentrated in the AI-certain subset, where discrimination approached that of double reading, supporting selective automation driven by explicit uncertainty quantification. These results indicate a pragmatic pathway to ease screening workload without compromising outcomes, particularly in programmes seeking scalable, audited use of AI for examination triage and recall decisions.
Source: Radiology
Image Credit: iStock