Large language models are increasingly marketed for clinical use, but their ability to support full-spectrum clinical reasoning remains uncertain. A cross-sectional investigation published in JAMA Network Open assessed 21 off-the-shelf models across 29 standardised clinical vignettes from the January 2025 update of MSD Manual vignettes. The models included GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro and Grok 4. Performance was assessed across sequential stages of the clinical workflow: differential diagnosis, diagnostic testing, final diagnosis, management and miscellaneous clinical reasoning questions. The benchmark used the Proportional Index of Medical Evaluation for LLMs, or PrIME-LLM, to measure balanced accuracy across these domains and to identify weaknesses that may not appear in traditional accuracy measures.

 

PrIME-LLM Measures Balanced Clinical Performance

The PrIME-LLM framework converts performance across the five reasoning domains into a single score. Each domain forms one point on a radar plot, and the model’s score is calculated from the polygonal area created by those five points. A full-scale reference polygon represents complete accuracy across all domains, while lower values reflect uneven or weaker performance. The score ranges from 0 to 1, with higher scores indicating stronger and more consistent clinical reasoning.

 

This approach differs from simple accuracy because it rewards balanced competence across the clinical workflow. A model can score well on final diagnosis while still performing poorly on earlier diagnostic steps. PrIME-LLM reduces the chance that high performance in one domain obscures weakness in another.

 

Must Read: Multimodal AI in Clinical Decision-Making

 

The benchmark used 29 stepwise clinical vignettes that present history, review of systems, examination findings and laboratory results. Questions followed the sequence of differential diagnosis, diagnostic testing, final diagnosis, management and additional clinical reasoning. The models processed each vignette sequentially, with clinical context preserved throughout. Medical student evaluators scored outputs against MSD Manual answer keys. Full credit required inclusion of all correct answers and exclusion of incorrect options. Each vignette was evaluated three times to capture variability across model runs.

 

Reasoning Models Perform Better but Remain Uneven

PrIME-LLM scores differed across the 21 models. Scores ranged from 0.64 for Gemini 1.5 Flash to 0.78 for Grok 4. A top-performing cluster included Grok 4, GPT-5, GPT-4.5, Claude 4.5 Opus, Gemini 3.0 Flash and Gemini 3.0 Pro, with many differences among these leading models not statistically significant. Newer releases generally outperformed earlier versions within the same developer families.

Overall accuracy clustered more narrowly than PrIME-LLM scores. Mean accuracy values sat between 0.81 and 0.90, while PrIME-LLM scores created wider separation between stronger and weaker models. This difference indicates that raw accuracy can conceal important variation in multidimensional reasoning performance.

 

Reasoning-optimised models outperformed models not advertised as having reasoning capabilities. The reasoning group achieved a mean PrIME-LLM score of 0.76, compared with 0.67 for the nonreasoning group. Regression analysis also associated reasoning capability with higher accuracy and higher PrIME-LLM radar area scores. However, the advantage did not remove persistent weaknesses in differential diagnosis.

 

Performance varied substantially by question type. Final diagnosis questions generally produced higher accuracy than diagnostic testing, differential diagnosis, management and miscellaneous reasoning questions. Differential diagnosis produced the lowest accuracy across models. This pattern suggests that models are stronger when selecting a final diagnosis from provided information than when generating and maintaining a broader diagnostic set.

 

Clinical Deployment Concerns Remain

Failure rates reinforced the gap between final answers and earlier diagnostic reasoning. Failure rates exceeded 0.80 for differential diagnosis in all models, while final diagnosis failure rates were below 0.40. Diagnostic testing, management and miscellaneous clinical reasoning questions produced intermediate failure rates with variation between models. Some vignettes were consistently difficult across model families, indicating broad limitations rather than isolated weaknesses in one developer group.

 

Multimodal performance was mixed. Eighteen image-capable models were assessed on vignettes that included images such as chest radiographs, computed tomography scans and electrocardiograms. Accuracy on non-image questions was generally more consistent, while performance on image-based questions varied by model. GPT-o3-Mini, Claude 3 Opus, GPT-4.5, Gemini 2.5 Pro, Gemini 3.0 Pro, Gemini 3.0 Flash and Grok 4 showed significantly higher accuracy on image-based items, while other models showed no significant modality difference.

 

Several constraints shape interpretation of the results. The benchmark assessed off-the-shelf models without external augmentation. Optional web search and reasoning settings were disabled where available. Models were accessed through a mixture of application programming interfaces and web-based interfaces. Public availability of the vignettes means prior exposure during pretraining cannot be fully excluded. The benchmark did not test retrieval-augmented generation, guideline access, calculators or agentic tool use, which may alter performance in clinical settings.

 

Current frontier large language models show strong performance on final diagnosis tasks but remain limited in differential diagnosis and uncertainty management. PrIME-LLM separates models more clearly than raw accuracy and exposes reasoning gaps that standard benchmarks can miss. Reasoning-optimised models perform better than nonreasoning models, but improvements remain incremental across the full clinical workflow. Off-the-shelf models therefore remain unsuitable for unsupervised patient-facing clinical decision-making, with the most responsible role limited to targeted, clinician-supervised use in low-uncertainty tasks.

 

Source: JAMA Network Open

Image Credit: iStock


References:

Rao AS, Esmail KP, Lee RS et al. (2026) Large Language Model Performance and Clinical Reasoning Tasks. JAMA Netw Open, 9(4):e264003.




Latest Articles

clinical reasoning AI, large language models healthcare, JAMA Network Open study, differential diagnosis AI, PrIME-LLM benchmark, medical AI performance, diagnostic accuracy LLMs Study in JAMA Network Open shows LLMs excel at final diagnosis but struggle with differential diagnosis and clinical reasoning across workflows.