Large language models (LLMs) are increasingly used in healthcare for education, research, and clinical care. They have shown promise by passing all stages of the US Medical Licensing Exam and generating accurate, empathetic responses to patient queries, suggesting the potential to assist physicians in complex clinical decision-making, including differential diagnosis. 

 

Diagnosing critically ill patients is particularly challenging due to complex presentations requiring rapid, accurate decisions in intensive care units (ICUs). While LLMs like ChatGPT-4o have demonstrated high accuracy on exam-level critical care questions, they may produce inconsistent or incorrect information in real-world scenarios, highlighting a gap in understanding their effectiveness for complex ICU diagnosis. Reasoning models, which apply structured, stepwise thinking, represent an advancement over traditional LLMs. 

 

DeepSeek-R1, a new open-source reasoning model released in 2025, has rapidly gained popularity and shows promise for complex critical care cases. A new study evaluated DeepSeek-R1’s diagnostic performance and compared critical care physicians’ accuracy and efficiency with and without its assistance to assess its potential benefits in ICU diagnosis. This study is among the first prospective trials to examine AI assistance in such high-stakes critical care diagnostic dilemmas, providing higher-quality evidence for the benefits of reasoning AI models in real-world clinical scenarios.


The study used challenging critical illness cases from literature and involved critical care residents from tertiary teaching hospitals. Participants were randomly assigned to either a non-AI-assisted group or an AI-assisted group using the reasoning model DeepSeek-R1. The study evaluated DeepSeek-R1’s response quality with Likert scales and compared diagnostic accuracy and efficiency between the two groups.

 

The study included 48 critical illness cases and 32 critical care residents, divided evenly into AI-assisted and non-AI-assisted groups, with each resident handling about three cases. DeepSeek-R1 received high median Likert scores for response completeness (4.0), clarity (5.0), and usefulness (5.0). AI’s top diagnosis accuracy was 60%, with a median differential diagnosis quality score of 5.0. Physicians without AI assistance had a top diagnosis accuracy of 27%, while those with AI assistance improved to 58%. Differential diagnosis quality scores were higher with AI (median 5.0) versus without (median 3.0). Additionally, AI assistance significantly reduced diagnostic time, from a median of 1920 seconds without AI to 972 seconds with AI. 

 

Overall, DeepSeek-R1 outperformed residents in diagnostic accuracy and improved both accuracy and efficiency when used as an aid. It produced complete, clear, and clinically useful differential diagnoses with reasonable accuracy, outperforming residents alone. Its assistance significantly improved residents’ diagnostic accuracy and reduced time, even when traditional online resources were available to all participants.

 

DeepSeek-R1’s superior performance may stem from its reinforcement learning-based pretraining and self-reflective reasoning abilities, enabling autonomous verification and optimisation of complex clinical logic. As an open-source model, it is also well-suited for resource-limited healthcare settings. However, like other LLMs, DeepSeek-R1 can produce hallucinations (confidently stated inaccuracies), underscoring the importance of its use as an assistive tool rather than a standalone decision-maker.

 

Overall, these findings support the potential of advanced reasoning AI models like DeepSeek-R1 to assist clinicians in improving diagnostic accuracy and efficiency in complex ICU cases while emphasising the need for cautious, collaborative integration into clinical practice. Further research with larger samples is needed to assess their potential for widespread clinical adoption in real-world critical care settings.

 

Source: Critical Care

Image Credit: iStock 

 


References:




Latest Articles

critical illness, Artificial Intelligence, AI, Large Language Models, LLMs LLM Diagnostic Performance in Complex Critical Illness