Large language models are entering diagnostic workflows as tools that can generate both suggested diagnoses and written explanations. Their value depends not only on whether a diagnostic suggestion is correct, but also on how the explanation supports clinical judgement. A randomised experiment published in npj Digital Medicine examined how different formats of large language model output affected radiologists’ diagnostic accuracy when reviewing patient cases with radiological images. The results show that explanation format matters. A step-by-step reasoning format produced the strongest improvement, while a differential diagnosis format created a higher risk of following incorrect advice. The findings place explanation design, prompt strategy and critical review of model output at the centre of clinical deployment decisions.
How the Diagnostic Task Was Designed
The experiment involved 101 radiologists from the United States and 2020 total assessments. Each radiologist reviewed 20 real-world patient cases, each containing a short clinical description and at least one CT or MRI image. Diagnoses used free-text answers rather than multiple-choice options, to better reflect clinical practice.
The cases came from the New England Journal of Medicine Image Challenge and covered a broad range of diagnostic difficulty. Most cases could be answered using knowledge from a standard diagnostic radiology textbook, while a smaller share required more specialised expertise. The selected cases included both general radiology content and subspecialty-specific challenges.
Random assignment placed radiologists into a control group without large language model support or into one of three supported groups. One group received a standard output, consisting of a single diagnosis with no explanation or only a brief rationale. A second group received a differential diagnosis, listing five possible diagnoses in descending likelihood, with short justifications. A third group received a chain-of-thought explanation, giving a detailed step-by-step rationale leading to the final recommendation.
GPT-4 generated the advice from the patient text and imaging data. Outputs remained unchanged, including diagnostic errors and hallucinations, to approximate routine use of large language model assistance in clinical settings.
Must Read: LLMs Simplify Radiology Reports for Patients
Step-by-Step Reasoning Improved Accuracy
Radiologists supported by chain-of-thought explanations achieved the best overall diagnostic performance. This format improved diagnostic accuracy by 12.2 percentage points compared with no large language model support. It also outperformed the standard output by 7.2 percentage points and the differential diagnosis format by 9.7 percentage points.
The standard output and differential diagnosis formats produced smaller differences compared with the control group. Standard output improved diagnostic accuracy by 5.0 percentage points, while the differential diagnosis format improved it by 2.5 percentage points. These differences remained comparable with the control group.
The advantage of chain-of-thought explanations remained robust after adjustment for physician-specific characteristics, including years of medical experience, radiology expertise, visual inspection workload, IT skills and experience with medical AI. Additional checks accounting for decision time, output length and answer length continued to show a significant advantage for step-by-step reasoning over no model support.
GPT-4 alone achieved moderate diagnostic accuracy across the selected cases. Accuracy varied by output format, with standard output reaching 75%, differential diagnosis reaching 65% for the top answer and 80% across the top five options, and chain-of-thought output reaching 80%. The case set therefore remained challenging for the model, reinforcing the need for careful physician review.
Explanation Format Shaped Clinical Reliance
The experiment also examined whether radiologists followed or overrode large language model advice. This behaviour differed substantially by explanation format. When the model-generated diagnosis was incorrect, adherence was highest in the differential diagnosis group. Radiologists in that group followed incorrect advice more often than those receiving standard output or chain-of-thought explanations.
When the model-generated diagnosis was correct, adherence was highest in the chain-of-thought group. This pattern points to more selective reliance: radiologists were more likely to follow correct advice and more likely to reject incorrect advice when the model provided step-by-step reasoning.
Differential diagnosis explanations may have encouraged over-adherence because they presented a fixed list of named possibilities. The format can narrow consideration to the listed options and reduce attention to alternatives outside the list, especially when none of the suggestions fully fits the clinical picture. In contrast, chain-of-thought reasoning focuses on one diagnostic pathway and gives radiologists a route to inspect the logic behind the suggestion.
The advantage of chain-of-thought explanations appeared across different physician and case characteristics. Radiologists with different levels of IT skill and different lengths of medical tenure showed a similar trend. General radiologists and subspecialised radiologists also showed broadly consistent results, including specialists working on cases matched to their area of expertise.
Explanation design has a measurable effect on how radiologists use large language model support. Step-by-step reasoning produced the strongest diagnostic performance and supported more appropriate reliance on model advice. Differential diagnosis output, despite its alignment with a familiar clinical format, increased the risk of following incorrect suggestions. The results reinforce the need to treat prompt design, explanation format and physician evaluation as core elements of medical AI implementation, rather than secondary interface details.
Source: npj Digital Medicine
Image Credit: iStock
References:
Spitzer P, Hendriks D, Rudolph J. et al. (2026) The effect of medical explanations from large language models on diagnostic accuracy in radiology. npj Digit. Med, 9:333.