Diagnostic errors in complex inpatient cases continues to challenge hospitals, motivating interest in whether large language models can complement clinical judgement. Recent evidence based on consecutive case records from a major academic centre compared several widely used AI systems with resident physicians, then explored what happens when each side learns from the other. The evaluation tracked whether the correct diagnosis appeared first, whether it appeared anywhere in the differential, how long the lists were and a global quality score. When physicians reviewed the output of a strong model their performance improved, and when models were seeded with clinicians’ differentials they also did better. This pattern points to practical, supervised collaboration rather than replacement, with potential value where specialist input is scarce. 

 

How AI And Clinicians Performed Side by Side 

Five publicly accessible models – Claude-Sonnet, Gemini 1.5, GPT-4, GPT-4o and OpenAI-o1 – were tested on 35 consecutive Massachusetts General Hospital case records spanning November 2023 to September 2024. Each model received the case narrative, tables and figures up to the discussant’s initial response and differential. OpenAI-o1 led performance, placing the final diagnosis at the top more often than others and scoring highest on overall quality. Other models showed mixed strengths, for example shorter or longer lists and different ranking of the correct answer within those lists. 

 

Resident physicians from several specialties provided their own differentials without AI assistance, then revisited their reasoning after seeing the ranked list from the best-performing model. Accuracy rose after exposure to the model output and the overall quality of differentials improved. Physicians tended to expand their lists slightly, reflecting consideration of additional possibilities surfaced by the model, while the typical position of the correct diagnosis within their lists was largely stable. Statistical tests applied to paired results indicated that the improvements in accuracy and differential quality were unlikely to be due to chance. 

 

What Changes When Each Side Learns from the Other 

The collaboration was bidirectional. Feeding pooled physician differentials back into the models increased the rate at which models placed the correct diagnosis first and the frequency with which it appeared anywhere in the list. Combining inputs from multiple clinicians was particularly helpful. When a single physician’s differential was supplied to one model, performance sometimes improved and sometimes dipped, but the aggregated lists consistently lifted model accuracy. The signal from diverse human reasoning appeared to counterbalance individual blind spots and guide the models toward more reliable prioritisation. 

 

Must Read: GenAI Strengthens Care in Autoimmune and Rheumatic Disease 

 

For clinicians, exposure to a high-performing model’s ranked list supported reappraisal rather than substitution. Their differentials became more comprehensive and their diagnostic hit rate increased, while core reasoning steps and the rough ordering of likely causes remained intact. The interplay is notable: the model benefited from structured human input and clinicians benefited from a curated, ranked reminder of conditions that might otherwise be overlooked. Together, these effects suggest a supervised, human-in-the-loop approach can make difficult cases more tractable without asking clinicians to surrender control. 

 

Scope, Limits and Practical Use 

The 35-case set covered multiple specialties and a range of final diagnoses, creating a varied testbed that mirrors the breadth of problems encountered on general services. The time window for selecting cases was chosen to reduce the chance that models had seen identical examples during training. Within this frame two practical messages emerge. First, when clinicians can consult a strong model’s differential as a second set of eyes, both accuracy and the judged quality of reasoning improve. Second, models respond to structured clinical input, with pooled human differentials providing a particularly effective boost. 

 

There are important limits. The number of cases and participating physicians was modest. The clinician cohort consisted of residents rather than consultants. Performance varied across models and across individual physician–model pairings, so local outcomes will depend on the specific tools, workflows and teams involved. Even so, the design demonstrated feasibility across multiple model families and capacities and marked out a simple interaction pattern that can be tested within real services: start with clinician-led reasoning, add a model’s ranked list as a cross-check, then allow the model to refine itself using aggregated clinician inputs. 

 

Translational implications are direct. General medical teams can explore supervised deployment where a model’s suggestions are used to broaden consideration, not to dictate decisions. Services with limited specialist coverage, including primary care or resource-constrained settings, may gain a safety net for rare or atypical conditions that fall outside routine experience. Governance remains essential. The results do not imply autonomy for AI tools, but rather support for structured collaboration that is auditable, measurable and compatible with existing diagnostic pathways. 

 

Taken together, the findings indicate that carefully designed collaboration between clinicians and large language models can raise diagnostic accuracy and improve the quality of differential construction in complex cases. A high-performing model offers a useful prompt that helps clinicians revisit possibilities, and pooled clinician input provides models with reliable guidance that sharpens their ranking. Within the acknowledged limits of case numbers and clinician experience, this pathway aligns with supervised decision support and offers a practical starting point for teams seeking safe, incremental gains in diagnostic performance. 

 

Source: The Lancet Digital Health 

Image Credit: iStock


References:

Lam K, Calvo Latorre J, Yiu A et al. (2025) Physician input improves generative artificial intelligence models’ diagnostic performance in solving complex clinical cases. The Lancet Digital Health: Online first. 



Latest Articles

AI in healthcare, clinical decision support, diagnostic accuracy, large language models in medicine, human AI collaboration, complex case diagnosis, medical AI research, digital health innovation Clinicians and leading AI models outperform each other alone in complex diagnoses, proving supervised human-AI collaboration can significantly improve diagnostic accuracy and care quality.