The integration of artificial intelligence into mammographic screening workflows has the potential to enhance diagnostic accuracy, streamline clinical operations and optimise resource allocation. However, the dynamic nature of AI models, which allows them to evolve through frequent software updates, introduces new challenges in clinical governance. Ensuring that performance improvements are genuine—and not accompanied by unintended regressions—is critical for patient safety. A recent study used the Personal Performance in Mammographic Screening scheme (PERFORMS) in the UK to compare two consecutive versions of a commercial AI system. This structured comparison against over 1,200 trained human readers presents a compelling framework for monitoring algorithmic evolution in real-world screening environments.

 

Benchmarking AI with Established Quality Assurance Methods 
PERFORMS has long served as a rigorous, external quality assurance scheme within the UK’s National Health Service Breast Screening Programme (NHSBSP). It provides annual assessments of human readers using test sets containing challenging mammographic cases. These test sets offer a consistent and high-quality benchmark that makes them ideal for evaluating AI performance. In this study, two versions of the same AI algorithm—Lunit Insight MMG—were tested on ten PERFORMS sets comprising 600 cases in total. Each version was evaluated independently, and results were compared to those of human readers who undertook the same tests between 2012 and 2023. The use of pathology-confirmed cases with known outcomes enabled rapid and accurate performance comparisons, providing a level of validation that is often unavailable in live clinical environments where diagnosis verification may take years.

 

The results demonstrated that the newer AI version (V2) maintained an area under the receiver operating curve (AUC) of 0.94, slightly higher than the previous version (V1) at 0.93. Despite this marginal increase, the difference was not statistically significant. Importantly, both AI versions achieved higher specificity than human readers (V1: 87.4%, V2: 88.2% vs. human average: 79.0%), indicating a reduced rate of false positives. This aspect is particularly valuable in reducing patient anxiety and unnecessary follow-up procedures. These findings support the feasibility of employing quality assurance schemes like PERFORMS to validate AI system updates quickly and robustly, without compromising clinical safety or performance.

 

Must Read: Predicting Breast Cancer Using Sequential Mammogram Analysis 

 

Evaluating Sensitivity Across Pathological and Radiological Features 
Beyond overall accuracy metrics, the study explored sensitivity by evaluating the ability of AI and humans to detect specific malignant lesions. Both AI versions outperformed human readers in several critical categories. While there was no significant difference in sensitivity between V1 and V2, the latter outperformed human readers (88.7% vs 83.2%, p = 0.04). AI V2 demonstrated improved performance, particularly in detecting invasive cancers and spiculated masses, highlighting its enhanced diagnostic utility in identifying more complex or aggressive cancer types.

 

Further analyses by cancer grade, subtype and lesion size revealed consistent results. AI V2 maintained equivalency or superiority to human readers across all categories, detecting higher proportions of lesions across breast densities and mammography vendors. For instance, in lesions greater than 20mm, AI V2 achieved 89% sensitivity compared to 84% for human readers. It also surpassed human detection of spiculated masses (96% vs 87%). These granular insights are crucial in understanding not just whether an AI system performs well overall, but how it performs under varied clinical and technical conditions. Such targeted assessments can guide informed decisions on when and where AI integration is most beneficial.

 

Advantages and Limitations of Test Set-Based Monitoring 
Traditional prospective trials, though valuable, are often resource-intensive and ill-suited to monitor incremental updates in AI software, especially in time-sensitive clinical settings. This study provides an alternative approach: deploying test set-based evaluations using a well-established quality assurance framework. This allows for rapid validation of algorithm changes against known standards, enabling the timely identification of performance regressions. This model could be especially useful in settings where AI updates are frequent and difficult to regulate using conventional clinical trials.

 

Nonetheless, limitations exist. The study’s retrospective design means the AI’s influence on human reporting was not assessed, as might occur during concurrent reading in real clinical settings. Furthermore, the use of cancer-enriched test sets could lead to inflated recall rates among human readers, introducing a ‘laboratory effect’. While AI performance is not affected by such contextual biases, it remains essential to account for these influences when interpreting results. Additionally, as the data stem from the NHSBSP, the findings may not be generalisable to populations with different ethnic compositions, screening intervals or clinical staffing models. 

 

The study outlined a scalable and replicable framework for evaluating AI performance in mammography using external quality assurance schemes. By comparing two software versions of the same commercial AI product against a broad cohort of trained human readers, the research demonstrates that diagnostic integrity can be maintained—or even improved—across algorithm updates. The approach offers a pragmatic alternative to clinical trials for ongoing AI monitoring, ensuring patient safety while allowing innovation to progress. As AI becomes more embedded in diagnostic workflows, such models for post-deployment performance assurance will be essential for sustaining clinical trust and regulatory compliance. 

 

Source: European Journal of Radiology 

Image Credit: Freepik


References:

Taib AG, James JJ, Partridge GJW et al. (2025) Keeping AI on Track: Regular monitoring of algorithmic updates in mammography. European Journal of Radiology: In Press. 



Latest Articles

AI in mammography, breast cancer screening, AI software updates, PERFORMS, diagnostic accuracy, NHSBSP, radiology AI monitoring, algorithm validation, artificial intelligence healthcare Explore how AI updates in mammography are monitored for accuracy, safety and clinical reliability.