Population-based breast cancer screening programmes have been effective in reducing mortality across Europe, largely through the detection of smaller, node-negative tumours at earlier stages. However, conventional mammography faces challenges such as limited sensitivity, a shortage of radiologists and variation in diagnostic accuracy. Artificial intelligence systems have been developed to assist in the interpretation of mammograms and could potentially enhance both efficiency and detection outcomes. A large-scale retrospective cohort study in the Netherlands evaluated whether AI could function as an independent second reader in national screening, comparing it against traditional single and double human reading approaches.
Evaluating AI in Population-Based Screening
The Dutch screening programme, established in 1989, invites women aged 50 to 75 years for biennial mammography. Traditionally, each examination is reviewed by two radiologists, with differences resolved through consensus. This double-reading approach aims to reduce missed cancers, but it requires considerable human resources and is still associated with a proportion of cancers that emerge between rounds as interval cases. In the retrospective analysis, 42,236 consecutive mammograms taken between September 2016 and August 2018 were processed using a deep learning-based AI system. The cohort represented a real-world population attending national screening, and cancer outcomes were tracked using the Netherlands Cancer Registry for up to 52 months. This long follow-up allowed researchers to capture not only screen-detected cancers but also interval cancers and those that became apparent in subsequent rounds, offering a more complete picture of detection performance.
Internal link: AI-Driven Breast Cancer Screening for Intermediate-Risk Women
Four scenarios were tested. The first was single human reading, representing the interpretation of the first radiologist without consensus. The second was double human reading, incorporating both radiologists’ views. The third was stand-alone AI reading, where the system’s recall decisions were applied independently. Finally, the fourth was combined single human reading with AI, simulating a situation in which the first radiologist’s decisions were supplemented by AI as a second reader. Each scenario was assessed for sensitivity, specificity, recall rates and the number of cancers detected. Additional analyses examined whether breast density influenced performance and compared tumour characteristics of cancers detected only by AI against those detected by radiologists.
Findings on Detection and Recall
Among the women screened, 579 were eventually diagnosed with breast cancer: 291 were detected through screening, 102 presented as interval cancers, and 187 were diagnosed in future rounds. Single human reading reached a sensitivity of 46.9% and a specificity of 97.7%, while double human reading improved sensitivity to 51.7% at the same level of specificity. Stand-alone AI, when adjusted to the recall rate of single human reading, achieved a sensitivity of 48.6% and specificity of 97.8%, closely mirroring the performance of human readers. The most notable gains came from combining AI with single human reading. In this scenario, sensitivity rose to 60.2%, a relative increase of 16.4% compared with double human reading. However, specificity fell to 95.8% and the recall rate increased from 2.9% to 5.0%, doubling the number of women referred for further investigation.
The additional cancers identified through AI support were clinically relevant. When compared with screen-detected cancers recognised by radiologists, AI-identified cancers missed at the time of initial reading were more often invasive and larger in size. At eventual diagnosis, 93% of AI-identified interval and future cancers were invasive, compared with 67% of human-detected screen cancers. Tumour size also differed, with nearly 28% of AI-identified cancers exceeding 20 mm, compared with just over 11% of human-detected cancers. Lymph node involvement was slightly higher among AI-identified cancers, indicating a tendency for these tumours to be more advanced by the time they were discovered without AI input. These findings underline the importance of AI’s contribution, as detecting such cancers earlier could reduce morbidity and improve long-term outcomes.
Breast density is often cited as a limiting factor for mammographic sensitivity, as dense tissue can mask lesions. In this study, AI performance was not significantly affected by density categories. Both human readers and AI showed reduced sensitivity in denser breasts, but the differences between them were not statistically significant. This suggests that AI may provide similar levels of support across a range of breast tissue types, countering concerns that its utility would be restricted in dense breasts.
Clinical Relevance of AI-Detected Cancers
A major consideration in screening is whether additional detections represent genuine clinical benefit or contribute to overdiagnosis. The results showed that cancers identified by AI but missed by radiologists were not indolent but clinically significant. These tumours were invasive in the vast majority of cases, often larger at diagnosis, and more frequently involved lymph nodes. The evidence suggests that AI can highlight cancers at an earlier stage, which, if acted upon, could prevent progression to more advanced disease. Moreover, some recalls originally considered false positives when judged against human readers were later confirmed as malignant after long-term follow-up. This finding challenges assumptions about false positive rates in AI-supported screening, showing that some of these cases represent early detection rather than overdiagnosis.
The combination of human and AI reading therefore presents a complementary relationship. While AI may miss some cancers identified by radiologists, it detects others that humans overlook. The balance of these detections suggests that using both approaches together can enhance overall sensitivity without dramatically compromising specificity. However, the increased recall rate is a clear challenge. More recalls mean additional diagnostic workups, which could raise workload and anxiety for patients. Effective arbitration mechanisms are needed to manage discordant findings between AI and human readers. Group review processes or the involvement of alternative AI systems could serve as solutions to ensure that increases in sensitivity are not offset by excessive false positives.
From a health system perspective, AI integration could also reduce reliance on scarce radiology resources. If AI can reliably function as one of the readers in a double-reading programme, the demand for radiologists could be reduced, freeing capacity for other clinical tasks. Yet, for safe implementation, protocols must be developed to handle disagreements between human and AI assessments, ensuring that potentially relevant cases are not dismissed prematurely.
Integrating AI as a second reader within breast cancer screening programmes improves sensitivity and detection of clinically relevant cancers compared with double human reading, independent of breast density. Although this approach increases recall rates and reduces specificity, the trade-off may be justified by the earlier detection of invasive and larger tumours. For implementation in population-based settings, effective arbitration processes are essential to manage discrepant findings and prevent unnecessary recalls. These results highlight AI’s potential to complement radiologists, reduce workload and enhance outcomes in national screening programmes.
Source: The Lancet Digital Health
Image Credit: iStock