Large Language Models (LLMs) have shown great promise in healthcare applications, from assisting with research to supporting clinical decision-making. However, their tendency to generate misleading or fabricated responses—commonly referred to as "hallucinations"—raises concerns about their reliability. While these models can accurately answer questions with well-established answers, they struggle with ambiguous topics, sometimes presenting false information with great confidence. This issue makes it crucial for healthcare professionals to critically evaluate the accuracy and trustworthiness of the LLMs they use. To address these concerns, researchers have conducted comparative analyses to evaluate the accuracy, diagnostic capabilities and trustworthiness of different LLMs, providing insights into which models perform best in different healthcare applications.
Evaluating Hallucination Rates
A key concern in using LLMs for healthcare is their susceptibility to hallucinations, which occur when a model generates incorrect or misleading information. A study published in Nature compared the hallucination rates of several LLMs, highlighting significant performance differences among them. The worst offenders included Technology Innovation Institute Falcon 7B-instruct (29.9%), Google Gemini 1.1-2B-it (27.8%) and Qwen 2.5.5B-instruct (25.2%). These models frequently provided inaccurate responses, which could be particularly dangerous in medical settings where incorrect information can lead to misdiagnoses or inappropriate treatment recommendations.
Recommended Read: Enhancing Oncologic Imaging with Large Language Models
Conversely, the most reliable models included OpenAI ChatGPT-4 (1.8%), OpenAI ChatGPT o1-mini (1.4%), Zhipu AI GLM-4-9B-Chat (1.3%) and Google Gemini 2.0 Flash Experimental (1.3%). These models demonstrated significantly lower hallucination rates, suggesting that they may be more suitable for healthcare applications where accuracy is critical. Another study assessing the role of LLMs in systematic reviews found similarly varied results, with hallucination rates of 39.6% for GPT-3.5, 28.6% for GPT-4 and an alarmingly high 91.4% for Bard. These findings emphasise the need for careful selection and validation of LLMs before integrating them into healthcare workflows, as models with high hallucination rates could pose risks to clinical decision-making and research integrity.
Assessing Diagnostic Accuracy
Beyond their hallucination rates, LLMs are also being evaluated for their potential role in medical diagnostics, where accurate and timely assessments are crucial. A Japanese study assessed the diagnostic capabilities of GPT-4o, Claude 3 Opus and Gemini 1.5 Pro in solving real-world radiology cases. These models were tested using over 300 quiz questions that included clinical histories and imaging findings. The models demonstrated varying degrees of accuracy, with diagnostic success rates of 41% for GPT-4o, 54% for Claude 3 Opus and 33.9% for Gemini 1.5 Pro. These results highlight the differences in performance across LLMs, suggesting that some models may be better suited for diagnostic support than others.
Similarly, researchers have tested the effectiveness of LLMs in rare disease diagnosis, an area where human expertise is often limited due to the infrequency of certain conditions. A BERT-based natural language processing tool, PhenoBrain, was evaluated using rare disease data sets from multiple countries, covering over 2,000 cases and 431 diseases. PhenoBrain outperformed 13 other predictive models, including GPT-4, achieving a top-3 recall of 0.513 and a top-10 recall of 0.654. These findings suggest that specialised models may have advantages in certain areas of diagnostics, particularly in cases where traditional knowledge is limited.
Improving Model Reliability with Specialised Training
One of the fundamental challenges with LLMs is their reliance on broad datasets that include both credible and unreliable sources. Many of the most well-known LLMs are trained on vast amounts of internet data, which contain both high-quality research and misinformation. This lack of specificity in training data can reduce their reliability in healthcare settings. Some companies are addressing this issue by developing models trained exclusively on high-quality biomedical data.
Platforms such as Consensus and OpenEvidence stand out for their reliance on peer-reviewed studies and systematic reviews rather than general internet data. These models aim to provide clinicians with more trustworthy insights by drawing from curated academic sources rather than from unreliable web pages. Additionally, the effectiveness of LLMs can be enhanced through prompt engineering. By asking follow-up questions—such as verifying the sources of a chatbot’s references—users can mitigate the risks of misinformation. For example, if a user asks ChatGPT to provide a list of studies on a specific diagnostic test, they should follow up by asking whether all references come from legitimate, peer-reviewed medical journals. If the chatbot's sources are unreliable, it will often admit to the issue when questioned directly, allowing users to verify citations independently. However, even with these improvements, independent verification remains essential to ensure reliability.
While LLMs are far from replacing trusted medical databases like PubMed, they are proving to be useful tools when applied cautiously. Their ability to process large volumes of information quickly can assist clinicians and researchers, provided their outputs are validated. The key to their successful integration lies in selecting well-trained models, employing careful prompt engineering and cross-checking responses with authoritative sources. As advancements continue, LLMs may become more reliable, but for now, their role in healthcare must be approached with careful scrutiny. Despite their limitations, these models hold significant potential in supporting clinical decision-making and research when used judiciously, ensuring that accuracy remains the priority in healthcare applications.
Source: Mayo Clinic
Image Credit: iStock