The integration of artificial intelligence (AI) into clinical diagnosis has reached a pivotal stage with the emergence of large language models (LLMs). These generative AI tools have demonstrated considerable potential in various medical tasks, including diagnosis, but questions remain about how they compare to traditional expert systems. A recent study examined this question by evaluating two widely used LLMs—ChatGPT-4 and Gemini 1.5—against a long-established diagnostic decision support system (DDSS) known as DXplain. Using thirty-six unpublished, diagnostically complex clinical cases from three academic centres, the study sought to determine whether the correct diagnosis appeared in each system’s differential diagnosis and where it ranked among the top 25 suggestions. The investigation offers insight into the respective strengths and limitations of each system, particularly when used with or without laboratory test data.
Study Design and Methodology
The study was conducted between October 2023 and November 2024 using clinical cases that had never been published or used in system training. Each case underwent a rigorous preparation process, being reviewed by three physicians to extract all relevant clinical findings. These findings were then input into the DDSS and both LLMs under several conditions: with and without laboratory results, and using either all identified findings or only those deemed diagnostically relevant. The DDSS required mapping of these findings to its controlled vocabulary, while the LLMs were prompted using a standard narrative input and asked to provide a rank-ordered list of at least 25 potential diagnoses.
Must Read: Improving Global Access to Diagnostics
To evaluate performance, the presence of the known diagnosis within each system’s top 25 outputs was recorded, along with the rank using a quintile scoring approach. Systems received points based on how high the correct diagnosis appeared—more points for higher placement. Statistical tests including mixed model analysis of variance and generalised estimating equations were employed to compare outcomes, although no differences reached formal statistical significance.
Findings and Performance Comparison
Results showed that all three systems performed better when laboratory test data were included. The DDSS listed the correct diagnosis in 72% of cases in the version using all clinical findings and lab results, compared to 64% for ChatGPT-4 and 58% for Gemini. Without lab data, performance declined across all systems, with the DDSS still leading by including the correct diagnosis in 56% of cases, compared to 42% for ChatGPT-4 and 39% for Gemini.
The DDSS also outperformed the LLMs in ranking the correct diagnosis higher within the list. For instance, in cases where the LLMs failed to include the diagnosis, the DDSS still captured it in 58% to 64% of those scenarios. However, the reverse also held true to some extent: the LLMs jointly identified correct diagnoses in 44% of the cases that the DDSS missed. This indicates that while the DDSS may be more consistent, LLMs still offer valuable diagnostic suggestions in a non-negligible number of instances.
A further analysis using a “winner take all” method—where the system ranking the correct diagnosis higher gained a point—also leaned in favour of the DDSS, although these findings again did not achieve statistical significance. Notably, variations in DDSS input (all findings vs relevant findings) and the inclusion of lab data had a clear impact on its performance, with all-findings plus lab input offering the strongest results.
System Characteristics and Implications
The comparative strengths of the systems stem from their different architectures. The DDSS has a knowledge base of over 2,600 disease profiles and 6,100 clinical findings, making it robust and deterministic. It is also transparent, with built-in explanation tools that support clinician understanding and trust. These features are especially important given that clinicians tend to favour tools that offer clear reasoning behind diagnostic suggestions.
Conversely, LLMs such as ChatGPT and Gemini are generalist tools capable of processing unstructured narrative input and generating readable, human-like responses. They require less manual input and can be updated automatically. However, their diagnostic reasoning is opaque, and their outputs may lack consistency, as they can vary with identical inputs. Moreover, LLMs are known to occasionally generate false information, which reduces trust in clinical contexts.
Despite these limitations, LLMs showed strong performance relative to their non-specialist design and, importantly, they captured diagnoses missed by the DDSS. This suggests that each system compensates for the other’s limitations. For example, the DDSS’s quasiprobabilistic algorithm sometimes favours common diseases when presented with nonspecific symptoms, potentially missing rare but correct diagnoses. In contrast, LLMs may provide broader differentials that include these rarer possibilities.
The comparison of an expert DDSS and two leading LLMs on complex, unpublished clinical cases highlights complementary strengths and weaknesses. While the DDSS demonstrated superior consistency and ranking accuracy, LLMs offered valuable diagnostic insights that the expert system sometimes missed. Given these findings, a hybrid approach could provide a more comprehensive and accurate diagnostic support system. By combining the structured, explainable logic of DDSS tools with the flexible, text-driven capabilities of LLMs, such integration may enhance clinician decision-making and ultimately improve patient outcomes. Continued refinement of both technologies and further research on their combined use in clinical settings will be essential in realising this potential.
Source: JAMA Network Open
Image Credit: iStock