The integration of artificial intelligence into healthcare has gained traction, particularly in diagnostics. Large language models (LLMs) such as Claude 3.5 Sonnet and GPT-4o have demonstrated potential in diagnosing complex cases with gastrointestinal (GI) symptoms. A recent study published in npj Digital Medicine has compared multiple LLMs against experienced gastroenterologists. It has provided valuable insights into AI's role in clinical decision-making, the diagnostic capabilities of LLMs, their advantages over traditional methods and the challenges that must be addressed before their widespread adoption.

 

The Diagnostic Performance of LLMs

LLMs have exhibited remarkable diagnostic accuracy, surpassing many human physicians in complex GI cases. In a comparative study, Claude 3.5 Sonnet demonstrated the highest diagnostic coverage at 76.1%, significantly outperforming the average physician coverage rate of 29.5%. GPT-4o followed with a coverage rate of 64.2%, while Claude 3 Opus reached 66.4%. Compared to human physicians who relied on conventional search engines and medical literature, LLMs offered broader differential diagnoses with increased efficiency. These models processed cases with speed and consistency, reducing the likelihood of human biases and cognitive overload that often occur in medical practice. Additionally, they provided a structured approach to diagnosis, considering multiple possible conditions rather than fixating on a single possibility too early in the diagnostic process.

 

While experienced gastroenterologists excel in pattern recognition and clinical intuition, LLMs contribute by rapidly synthesising vast medical knowledge. GPT-4o and Claude 3 Opus, for instance, demonstrated accuracy rates of 42.9% and 44.4%, respectively, significantly higher than the 24.3% accuracy observed among the physicians. These models can recall uncommon conditions and assess complex cases without fatigue, a crucial advantage when diagnosing challenging or rare GI diseases. The ability of LLMs to integrate multiple data sources and analyse patient histories comprehensively enhances their value as a diagnostic tool. However, despite these strengths, physicians remain irreplaceable due to their capacity for nuanced decision-making, contextual understanding and patient interaction.

 

Advantages of LLMs in Clinical Decision-Making

Beyond accuracy, LLMs provide several practical benefits. They significantly reduce diagnostic time, allowing physicians to focus on patient management and treatment. While physicians spent an average of 6.6 minutes per diagnosis, LLMs completed their diagnostic tasks in just 0.19 minutes. Additionally, they offer cost-effective diagnostic support, as their usage fees remain lower than specialist consultations. The cost per query for Claude 3.5 Sonnet was approximately $0.0104, making it a highly cost-effective option compared to a typical outpatient visit charge ranging from $3 to $30. The ability of LLMs to suggest differential diagnoses expands the scope of possibilities that physicians might consider, particularly in cases where symptoms are atypical or overlap with multiple conditions. Furthermore, LLMs can serve as decision-support tools, supplementing physicians' expertise rather than replacing them.

 

Another key advantage of LLMs is their accessibility. Medical professionals can utilise these tools without needing immediate access to extensive medical literature or specialist opinions, making them particularly valuable in under-resourced healthcare settings. The consistency of LLMs also ensures that they apply the same rigorous methodology across all cases, unlike human physicians, who may be influenced by external factors such as fatigue or cognitive biases.

 

However, despite these advantages, LLMs should be integrated into clinical workflows with caution. Their role should be viewed as complementary rather than substitutive, aiding rather than replacing the expertise of trained physicians. Effective implementation requires appropriate training for medical professionals to interpret AI-generated diagnoses critically, ensuring that human oversight remains a fundamental component of medical decision-making.

 

Challenges and Limitations

Despite their promise, LLMs face several challenges that must be addressed for safe and effective clinical integration. One significant issue is the presence of hallucinations, where the model generates inaccurate or misleading information. While Claude 3.5 Sonnet exhibited the lowest rate of hallucinations at 21.3%, models like Gemini-1.5-pro showed significantly higher hallucination rates, reaching 62.7%. Additionally, while LLMs demonstrate proficiency in knowledge retrieval, they may misinterpret key clinical clues or lack contextual understanding. Concerns regarding patient data privacy and ethical considerations also pose barriers to implementation. Regulatory frameworks and robust validation studies are needed to ensure that AI-driven diagnostics align with existing medical standards.

 

Moreover, while LLMs excel in knowledge recall, they do not possess reasoning abilities equivalent to human clinicians. This limitation means that AI-generated diagnoses should not be accepted uncritically, as they may lack the depth of understanding that comes with years of clinical experience. Physicians must remain vigilant, verifying the accuracy of AI-generated suggestions and using their own judgement when making final decisions.

 

Another critical consideration is data security. The use of AI in diagnostics necessitates stringent measures to protect patient confidentiality and ensure compliance with healthcare regulations. Without appropriate safeguards, AI tools risk exacerbating concerns over data breaches and unauthorised access to sensitive medical information. Addressing these challenges will be vital in ensuring that AI-based diagnostic support is both safe and effective for clinical use.

 

LLMs represent a transformative advancement in medical diagnostics, particularly for challenging GI cases. Their ability to generate differential diagnoses, enhance decision-making and reduce diagnostic inefficiencies makes them valuable tools for physicians. However, their limitations necessitate cautious implementation, with human oversight remaining essential. By addressing the challenges associated with AI in healthcare, LLMs can evolve into indispensable assets that improve diagnostic accuracy and patient outcomes. Their role in the medical field is promising, but their successful integration will depend on rigorous validation, responsible implementation and continuous collaboration between AI developers and healthcare professionals.

 

Source: npj Digital Medicine

Link:

Reference:

Image:


References:

Yang X, Li T, Wang H. et al(2025) Multiple large language models versus experienced physicians in diagnosing challenging cases with gastrointestinal symptoms. npj Digit. Med, 8:85.



Latest Articles

AI in healthcare, large language models, LLMs in diagnostics, GPT-4o, Claude 3.5 Sonnet, GI symptoms, gastroenterology AI, clinical decision-making, AI vs. doctors, medical AI accuracy, AI-driven diagnosis AI-driven LLMs like GPT-4o and Claude 3.5 Sonnet are transforming GI diagnostics, surpassing human accuracy while enhancing clinical decision-making.