Artificial intelligence continues to advance in healthcare, offering new tools to support clinical reasoning, diagnostics and education. Among these innovations, large language models (LLMs) and vision-language models (VLMs) have shown growing potential to assist clinicians in interpreting complex medical information. However, their true effectiveness in specialised domains such as gastroenterology remains uncertain. Gastroenterology requires the integration of diverse data sources – text, images and laboratory findings – making it a demanding environment for automated reasoning. To assess model capability in this field, a benchmarking analysis compared proprietary and open-source systems using standardised, board-style multiple-choice questions. The evaluation explored model accuracy, efficiency and ability to interpret image-based content, providing insight into how far these tools have progressed toward reliable clinical support.

 

Accuracy and Configuration of Language Models

Performance varied considerably across the 36 evaluated models. Proprietary LLMs achieved the highest scores, with the leading system reaching 82% accuracy, surpassing the average human score of 74%. Other commercial models recorded results around 74%, while the best open-source systems achieved between 61% and 66%. Smaller, quantised open-source models, which can operate on consumer-grade hardware, performed moderately well, achieving around 50% accuracy while maintaining lower resource requirements. Model configuration played a key role in determining performance. Adjusting prompts, applying structured outputs and fine-tuning parameters increased accuracy by as much as 10%, demonstrating that model settings significantly influence outcomes. Using optimised prompts that required justification of answers and expert-like reasoning produced the most consistent results across environments. Models running in web interfaces performed comparably to those accessed through application programming interfaces when the same configurations were applied.

 

Must Read: AI in Gastrointestinal Diagnostics

 

Model Compression and Performance Trade-Offs

Quantisation – reducing model precision to lower computational costs – was identified as a practical method to make large LLMs accessible without major losses in reasoning ability. Compressing models from 32-bit to 8-bit or 6-bit precision reduced memory demands substantially, allowing them to run locally on systems with only 16 GB of memory. Most quantised models retained performance comparable to their full-size versions, confirming that the approach enables privacy-preserving, cost-efficient implementation. Only one model displayed a marked decrease in accuracy after quantisation. Interestingly, models trained specifically on medical data did not outperform general-purpose models, showing that narrow fine-tuning alone does not ensure better reasoning outcomes. Larger quantised models often performed better than smaller, full-precision ones while maintaining efficient memory use, suggesting that optimisation and compression may offer a balanced path toward practical deployment in clinical or educational settings. Nevertheless, the process introduces potential variability, underscoring the need for careful validation when applying these models in healthcare.

 

Vision-Language Model Limitations

VLMs, which combine text and image processing, were also assessed to determine whether they could interpret clinical images effectively. Performance on image-based questions revealed clear limitations. Providing images directly to the models rarely improved accuracy and, in some cases, reduced it. Only a few systems, including Claude-3-Sonnet and Llama3.2-11b, showed modest gains when images were added. When human-generated image captions were supplied instead, accuracy improved markedly – by up to 29% in some cases – suggesting that models benefited from human interpretation of visual data. Automatically generated captions from the models themselves did not yield similar improvements and sometimes decreased accuracy.

 

These findings indicate that current VLMs struggle with visual reasoning and fail to process fine clinical details effectively. The issue may stem from the lack of specialised medical imagery in their training data and the loss of critical resolution when images are downscaled during processing. Accuracy also varied across image types: performance was stronger for endoscopic and radiological images and weaker for histological and manometric ones. This inconsistency highlights the early stage of multimodal reasoning in medicine and the need for improved architectures capable of handling complex visual data.

 

Benchmarking proprietary and open-source models for gastroenterology reasoning revealed substantial advances in text-based performance but continued shortcomings in visual interpretation. Proprietary systems achieved the highest accuracy, yet open-source models are narrowing the gap, offering increasingly competitive results and the potential for secure, locally deployed use. Quantisation proved effective in reducing computational demands without significantly sacrificing accuracy, while fine-tuned medical models failed to outperform broader, general-purpose systems. The limited success of VLMs underscores the challenges of applying AI to diagnostic imaging, where visual comprehension remains inadequate for clinical use. These findings highlight that LLMs can already serve as valuable tools for structured decision support and education, but their outputs must be applied cautiously. Integrating these systems safely into practice will depend on continuous evaluation, improved model transparency and standardised benchmarks that assess not only accuracy but also consistency, reliability and interpretability.

 

Source: npj digital medicine

Image Credit: iStock


References:

Safavi-Naini SAA, Ali S, Shahab O et al. (2025) Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning. npj Digit Med: In Press.



Latest Articles

AI in gastroenterology, clinical decision support AI, large language models healthcare, vision language models medicine, medical AI benchmarking, AI diagnostics gastroenterology, digital health AI Benchmarking AI language and vision models for accuracy, efficiency and clinical reasoning in gastroenterology.