Comparing AI Diagnostic Tools in Medicine

In Artificial intelligence
Mon, 9 Jun 2025

The integration of artificial intelligence (AI) into clinical diagnosis has reached a pivotal stage with the emergence of large language models (LLMs). These generative AI tools have demonstrated considerable potential in various medical tasks, including diagnosis, but questions remain about how they compare to traditional expert systems. A recent study examined this question by evaluating two widely used LLMs—ChatGPT-4 and Gemini 1.5—against a long-established diagnostic decision support system (DDSS) known as DXplain. Using thirty-six unpublished, diagnostically complex clinical cases from three academic centres, the study sought to determine whether the correct diagnosis appeared in each system’s differential diagnosis and where it ranked among the top 25 suggestions. The investigation offers insight into the respective strengths and limitations of each system, particularly when used with or without laboratory test data.

Study Design and Methodology
The study was conducted between October 2023 and November 2024 using clinical cases that had never been published or used in system training. Each case underwent a rigorous preparation process, being reviewed by three physicians to extract all relevant clinical findings. These findings were then input into the DDSS and both LLMs under several conditions: with and without laboratory results, and using either all identified findings or only those deemed diagnostically relevant. The DDSS required mapping of these findings to its controlled vocabulary, while the LLMs were prompted using a standard narrative input and asked to provide a rank-ordered list of at least 25 potential diagnoses.

Must Read: Improving Global Access to Diagnostics

To evaluate performance, the presence of the known diagnosis within each system’s top 25 outputs was recorded, along with the rank using a quintile scoring approach. Systems received points based on how high the correct diagnosis appeared—more points for higher placement. Statistical tests including mixed model analysis of variance and generalised estimating equations were employed to compare outcomes, although no differences reached formal statistical significance.

Findings and Performance Comparison
Results showed that all three systems performed better when laboratory test data were included. The DDSS listed the correct diagnosis in 72% of cases in the version using all clinical findings and lab results, compared to 64% for ChatGPT-4 and 58% for Gemini. Without lab data, performance declined across all systems, with the DDSS still leading by including the correct diagnosis in 56% of cases, compared to 42% for ChatGPT-4 and 39% for Gemini.

The DDSS also outperformed the LLMs in ranking the correct diagnosis higher within the list. For instance, in cases where the LLMs failed to include the diagnosis, the DDSS still captured it in 58% to 64% of those scenarios. However, the reverse also held true to some extent: the LLMs jointly identified correct diagnoses in 44% of the cases that the DDSS missed. This indicates that while the DDSS may be more consistent, LLMs still offer valuable diagnostic suggestions in a non-negligible number of instances.

A further analysis using a “winner take all” method—where the system ranking the correct diagnosis higher gained a point—also leaned in favour of the DDSS, although these findings again did not achieve statistical significance. Notably, variations in DDSS input (all findings vs relevant findings) and the inclusion of lab data had a clear impact on its performance, with all-findings plus lab input offering the strongest results.

System Characteristics and Implications
The comparative strengths of the systems stem from their different architectures. The DDSS has a knowledge base of over 2,600 disease profiles and 6,100 clinical findings, making it robust and deterministic. It is also transparent, with built-in explanation tools that support clinician understanding and trust. These features are especially important given that clinicians tend to favour tools that offer clear reasoning behind diagnostic suggestions.

Conversely, LLMs such as ChatGPT and Gemini are generalist tools capable of processing unstructured narrative input and generating readable, human-like responses. They require less manual input and can be updated automatically. However, their diagnostic reasoning is opaque, and their outputs may lack consistency, as they can vary with identical inputs. Moreover, LLMs are known to occasionally generate false information, which reduces trust in clinical contexts.

Despite these limitations, LLMs showed strong performance relative to their non-specialist design and, importantly, they captured diagnoses missed by the DDSS. This suggests that each system compensates for the other’s limitations. For example, the DDSS’s quasiprobabilistic algorithm sometimes favours common diseases when presented with nonspecific symptoms, potentially missing rare but correct diagnoses. In contrast, LLMs may provide broader differentials that include these rarer possibilities.

The comparison of an expert DDSS and two leading LLMs on complex, unpublished clinical cases highlights complementary strengths and weaknesses. While the DDSS demonstrated superior consistency and ranking accuracy, LLMs offered valuable diagnostic insights that the expert system sometimes missed. Given these findings, a hybrid approach could provide a more comprehensive and accurate diagnostic support system. By combining the structured, explainable logic of DDSS tools with the flexible, text-driven capabilities of LLMs, such integration may enhance clinician decision-making and ultimately improve patient outcomes. Continued refinement of both technologies and further research on their combined use in clinical settings will be essential in realising this potential.

Source: JAMA Network Open

Image Credit: iStock

References:

Feldman MJ, Hoffer EP, Conley JJ et al. (2025) Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses. JAMA Netw Open, 8(5):e2512994.

clinical decision support, diagnostic accuracy, AI healthcare, LLMs in medicine, AI in diagnostics, DXplain, ChatGPT-4, Gemini 1.5, generative AI in diagnosis

Latest Articles

Hospitals of the Future: The Next Frontier in Patient-Centred Care
- Journal Article
- 18/10/2025
Hospitals are rapidly evolving into smart, connected ecosystems focused on proactive, personalised care. Leveraging AI, robotics, remote monitoring and digital health tools, they enhance diagnostics, improve workflows and support decentralised models like virtual wards. Predictive analytics, interoper
READ MORE
AI Orchestration in Emergency Radiology – Implementation in the Valencia Health Region
- Journal Article
- 18/10/2025
The Valencia Health Region deployed a vendor-neutral AI orchestration system across 29 hospitals to improve emergency radiology. Validated at Hospital General Universitario Dr Balmis, it streamlines triage, accelerates diagnoses and reduces radiologists’ workload. The system processes over 5,700 studi
READ MORE
Advancement of 3D Printing in Healthcare and Its Impact on Sustainability
- Journal Article
- 18/10/2025
3D printing is transforming healthcare through personalised devices, surgical precision and faster prototyping while advancing sustainability. On-demand production reduces waste, supports circular economy models and lowers carbon footprints by minimising transport and inventory. Despite its promise,...
READ MORE

AI in diagnostics, ChatGPT-4 medicine, Gemini healthcare, DDSS comparison, clinical AI tools, medical diagnosis LLMs, AI vs expert systems, DXplain study Study compares ChatGPT-4, Gemini, and DXplain for clinical diagnosis accuracy using real cases.

Comparing AI Diagnostic Tools in Medicine

References:

Latest Articles

Related Articles

Latest News

INFO

IMAGING

ICU

EXEC

IT

CARDIOLOGY

JOURNALS

EVENTS

FACULTY

PARTNERS

JOBS

COMPANIES

PRODUCTS

BLOG

VIDEOS

Communities

CONTACT US

EU Office

Rue Villain XIV 53-55

B-1050 Brussels, Belgium

Tel: +357 86 870 007

E-mail: [email protected]

EMEA & ROW Office

166, Agias Filaxeos

CY-3083, Limassol, Cyprus

Tel: +357 86 870 007

E-mail: [email protected]

Headquarters

Kosta Ourani, 5

Petoussis Court, 5th floor

CY-3085 Limassol, Cyprus

E-mail: [email protected]