Clinical decisions are often made with incomplete evidence in notes and records, which increases the risk of error when key findings are missing or inconclusive. Diagnosis support that simply produces labels without transparent reasoning is difficult to trust in these conditions. An uncertainty-aware approach that recognises when evidence is insufficient and explains both the conclusion and its limits offers a route to safer use. ConfiDx, a large language model (LLM) fine-tuned with diagnostic criteria, was developed to meet this need. It frames diagnosis as a joint task of identifying the most likely disease, detecting the presence or absence of diagnostic uncertainty and providing concise explanations for each. Evaluations on real-world corpora show gains in diagnostic performance, stronger uncertainty recognition and clearer rationales, with additional benefits when used alongside clinicians. 

 

Task Design and Data 

The work formalises uncertainty-aware diagnosis into four coordinated subtasks that mirror clinical reasoning: disease diagnosis, diagnostic explanation, uncertainty recognition and uncertainty explanation. By separating these steps, the approach aligns model outputs to guideline-based criteria and ensures that every decision is accompanied by targeted justification. ConfiDx is instantiated on open-source LLMs with 70 billion parameters and adapted through instruction fine-tuning to follow diagnostic rules rather than rely only on pattern matching in text. 

 

Training and evaluation rely on de-identified clinical notes drawn from MIMIC-IV and a cross-institutional set from the University of Minnesota Clinical Data Repository (UMN-CDR). To probe robustness, a hold-out collection from MIMIC-IV (MIMIC-U) excludes certain disease types during training and reintroduces them only at test time. Public case reports from PubMed Central and The New England Journal of Medicine enable comparison with large commercial systems that cannot process privacy-sensitive clinical material. An annotation framework maps guideline criteria to evidence spans in notes, with medical experts verifying outputs to ground predictions and explanations in standard definitions. 

 

Must Reas: LLMs Near Physician Diagnosis but Lag in Triage 

 

Performance and Explanations 

Across the main test sets, fine-tuned ConfiDx models improve diagnostic accuracy by more than 68% relative to their off-the-shelf counterparts. Recognition of diagnostic uncertainty also shifts markedly, indicating that the system is more likely to flag insufficient evidence instead of overstating confidence. Beyond labels, explanation quality strengthens on both fronts: rationales supporting the chosen diagnosis align more closely with guideline-based evidence, and explanations of uncertainty move from minimal levels toward consistent, clinically meaningful statements. 

 

These trends hold when disease types unseen during training are introduced and when evaluation moves across institutions. On the cross-institutional UMN-CDR data, accuracy increases are sustained, and uncertainty recognition remains substantially higher than baselines. When compared on public case reports against large-scale commercial LLMs, ConfiDx delivers uncertainty recognition in the range of 0.80 to 0.90, while commercial systems fall within a lower band of 0.45 to 0.65. The pattern is reinforced by targeted examples in which commercial outputs overlook missing criteria, whereas ConfiDx identifies the gap and explains why a confident diagnosis is not supported. 

 

Generalisability and Human–AI Collaboration 

Analyses of training signals show that both data volume and diversity contribute to outcomes. When data are reduced, performance drops, then steadily improves as coverage expands, suggesting the model benefits from exposure to varied note content and diagnostic contexts. Ablation studies indicate that each subtask objective makes a measurable contribution, which aligns with the idea that diagnosis, explanation and uncertainty detection reinforce one another when learned jointly. 

 

A collaboration study with clinicians highlights complementary strengths rather than substitution. Human experts working alone achieve strong diagnostic performance, yet their recognition and articulation of diagnostic uncertainty improve when assisted by ConfiDx. In these assisted settings, uncertainty recognition rises by 10.7% and explanations of uncertainty improve by 26%. Agreement analyses suggest the model uses expert inputs as a reference point rather than simply mirroring them, indicating potential to support decision-making where evidence is incomplete or ambiguous while keeping experts in control. 

 

Anchoring an LLM to diagnostic criteria and explicitly modelling diagnostic uncertainty leads to clearer, more reliable outputs in settings where evidence often falls short. ConfiDx raises diagnostic accuracy, strengthens detection of insufficient evidence and delivers more faithful explanations across internal tests, unseen diseases and a separate institution. It also compares favourably with large commercial systems on public case material and enhances clinicians’ ability to recognise and explain uncertainty. The results point to the value of systems that make reasoning visible and signal the limits of available evidence, helping teams calibrate trust and act with greater clarity. 

 

Source: npj digital medicine 

Image Credit: iStock


References:

Zhou S, Wang J, Xu Z et al. (2025) Uncertainty-aware large language models for explainable disease diagnosis. npj Digit Med; 8, 690.



Latest Articles

ConfiDx, diagnostic uncertainty, clinical AI, healthcare LLM, medical decision support, guideline-based diagnosis, explainable AI, digital medicine Uncertainty-aware LLM ConfiDx improves diagnostic accuracy, recognises insufficient evidence and delivers clearer, guideline-based clinical explanations.