The rapid evolution of generative artificial intelligence (AI) has reshaped numerous sectors, with healthcare standing out as a particularly promising domain. Advanced language and vision models have demonstrated the ability to interpret clinical data, generate human-like text and process complex medical information. These capabilities have stimulated growing interest in the diagnostic potential of generative AI systems. However, despite considerable research, a comprehensive evaluation of how generative AI compares with practising physicians in clinical diagnostic accuracy has remained limited.
A systematic review and meta-analysis published in npj Digital Medicine now offers critical insight. By analysing 83 studies across a range of medical disciplines from 2018 to 2024, this work highlights both the strengths and limitations of generative AI models. Although some models show parity with non-expert physicians, overall performance remains below expert levels. The findings underscore the emerging role of generative AI in supporting clinical workflows and medical training, albeit with important considerations regarding reliability and context.
Diagnostic Accuracy and Model Diversity
The meta-analysis drew upon a diverse array of generative AI models, with the most frequently studied being GPT-4 and GPT-3.5. These models were assessed across a spectrum of specialties, including general medicine, radiology, emergency medicine, dermatology, ophthalmology and others. The overall diagnostic accuracy across all models was found to be 52.1%, with a 95% confidence interval of 47.0% to 57.1%. This moderate level of accuracy illustrates the models’ ability to approximate medical reasoning but also highlights significant limitations.
When compared directly to physicians, the results showed no significant difference overall (p = 0.10), and no difference when compared to non-expert physicians (p = 0.93). However, generative AI models performed significantly worse than expert physicians, with a difference in accuracy of 15.8% (p = 0.007). Some individual models, such as GPT-4o, Claude 3 Sonnet, Llama 3 70B, Gemini 1.5 Pro and Perplexity, demonstrated slightly higher performance in comparison with non-expert physicians, although these differences were not statistically significant. These findings suggest that while the capabilities of generative AI are advancing, consistency and precision remain challenges, particularly when benchmarked against clinical expertise.
Clinical Applicability and Specialisation
The diagnostic tasks evaluated in the study spanned general and specialist medicine. Although most medical fields showed no significant difference in performance, dermatology and urology emerged as exceptions. These specialties demonstrated significant differences, with p-values below 0.001. The enhanced performance of AI in dermatology may be attributable to the visual nature of the field, where pattern recognition plays a major role—an area where AI excels. However, dermatology also involves nuanced clinical reasoning and individualised patient assessment, limiting the extent to which AI can replace human judgement.
The urology findings were based on a single large-scale study, which may affect the broader applicability of its conclusions. Most studies assessed AI using external datasets, though 25 studies did not clarify whether test data were external, due to the unknown origins of model training datasets. This gap raises concerns about the robustness of the evaluation. Despite some models showing potential in specific contexts, generative AI systems have not yet demonstrated consistent diagnostic reliability across disciplines. Their value may be greatest in settings where resources are limited or where preliminary diagnostic support could enhance the efficiency of patient care.
You may also like: When Fluency Misleads: The Limits of Generative AI in Healthcare
Educational Utility and Methodological Constraints
In addition to clinical deployment, generative AI shows promise in medical education. The observed parity between AI and non-expert physicians indicates potential for use in training environments. AI models could assist students and junior clinicians by simulating diagnostic scenarios, providing case variety and supporting reflective learning. However, methodological concerns in the analysed studies warrant caution.
According to the study's quality assessment using the PROBAST tool, 76% of the studies had a high risk of bias, primarily due to small sample sizes or lack of transparency in training data. In contrast, 24% of studies were rated at low risk of bias. When analysing only the low-risk subgroup, the results remained largely consistent, suggesting that overall findings are not substantially affected by bias levels. Nevertheless, the absence of demographic information in many studies limits understanding of how AI performance may vary across populations and healthcare contexts.
Without transparency in data sources, it is difficult to determine whether models have been evaluated on truly novel datasets. This limits confidence in their generalisability and real-world applicability. Transparency is therefore essential, both for ensuring scientific integrity and for fostering trust in AI-assisted diagnostics.
This review presents a detailed understanding of generative AI's current role in clinical diagnostics. Although models such as GPT-4, Gemini 1.5 Pro, Claude 3 Opus and others demonstrate potential, the overall diagnostic accuracy of 52% underscores that they are not yet suitable replacements for expert clinicians. However, their performance is often on par with that of non-expert physicians, suggesting utility as decision-support tools and educational resources.
The findings highlight the importance of model specialisation, transparency in training data and rigorous evaluation. Moving forward, the development of more accurate and accountable generative AI systems will be crucial for their successful integration into healthcare. These tools may enhance medical practice and education, provided their limitations are acknowledged and addressed through continued research and thoughtful implementation.
Source: npj digital medicine
Image Credit: iStock