Generative AI now occupies a growing place in clinical reasoning, with large language models showing strong performance on medical cognitive tasks while also reflecting human-like biases. Reasoning models, publicly introduced in September 2024 and now used in commercial products such as ChatGPT and Gemini, add chain-of-thought processing during inference. This allows them to break complex clinical scenarios into steps, check logic, correct errors and improve performance across cognitive tasks. An editorial in BMJ Quality and Safety calls for a shift away from evaluating these systems as isolated, autonomous decision-makers. It considers large language models as cognitive technologies whose value depends on how clinicians and artificial intelligence work together to support decision-making, rather than on whether model outputs appear flawless in artificial benchmark settings.
Rethinking Bias in Clinical Reasoning
Large language models inherit both the strengths and weaknesses of the human material used to train and reinforce them. Their outputs can resemble clinician reasoning because they draw on vast bodies of human text and reasoning examples. Yet similar patterns in text-based benchmarks do not mean that human and model biases share the same causes. Human cognitive bias can involve emotion and context, while model bias reflects statistical association.
Clinical reasoning has long treated heuristics as possible contributors to diagnostic error. That framing can obscure their role as efficient, experience-based shortcuts that help experts make decisions. Heuristics have been recognised as a core feature of clinical reasoning since the 1970s and can represent a strength rather than a flaw. Evidence increasingly suggests that apparent cognitive biases may often reflect knowledge deficits instead. Teaching learners about cognitive biases has also shown limited effect in reducing such errors.
Must Read: Healthcare AI Trust Requires Human Accountability
Cognitive bias labels usually appear only after an incorrect diagnosis. Clinical vignettes add another limitation because they often aim to lure a person or model into a predictable error. A bias found under such conditions shows possibility, not real-world frequency or impact in clinical environments where information is sparse and collaboration occurs.
Limits of Current AI Testing
Current evaluation approaches often rely on older mental models. Some treat large language models like earlier decision support tools, including expert systems or Bayesian prediction methods. Measures such as sensitivity, specificity, precision and recall can be useful for defined tasks, but they assume a fixed deterministic classifier. Large language models operate differently. They are stochastic, generalist systems whose performance can shift substantially according to context.
Other evaluations borrow from medical education and treat artificial intelligence as though it were a human trainee taking autonomous responsibility for patients. That assumption does not match the present role of large language models. These systems are algorithms and cannot function autonomously, even as they increasingly perform tasks traditionally associated with physicians, including history taking. There is no compelling evidence that they can provide truly independent medical management.
The expectation of infallibility also creates a poor basis for evaluation. Human clinicians seek second opinions while accepting that colleagues may be wrong. Fallible second opinions and group intelligence can still improve performance. The same logic applies to cognitive technologies. Measuring whether an advanced language model shows a particular fallacy reveals little about whether it helps a clinician care for patients. The relevant question is how a fallible tool changes human performance in practice.
From Model Performance to Collaboration
The human-AI dyad offers a different unit of assessment. The focus moves from model output alone to the combined performance of clinician and system. Earlier experience with computer-aided diagnosis already shows that human interaction can improve or worsen algorithmic performance. Similar concerns appear in aviation, industrial safety and automated driving, where technology and human oversight form a shared operating environment.
Within this framing, some model biases could have useful effects in specific circumstances. Automation bias remains a risk when users accept outputs from high-performing systems too readily. A visible discrepancy or imperfect model response may prompt critical thinking rather than passive acceptance. Biases may also differ in complementary ways. An artificial intelligence differential diagnosis might incorporate population base rates, while a clinician may adjust probabilities according to the setting and local context. A mismatch between the two could encourage closer appraisal and diagnostic reflection.
The most effective forms of clinical human-AI collaboration remain uncertain. Future systems may interface directly with patients, but regulated models would still connect with clinicians and other parts of the health system. The central task is therefore not to optimise artificial intelligence output in isolation. Evaluation needs to examine whether interaction between clinician and model improves diagnostic performance compared with the clinician alone, including when the artificial intelligence remains imperfect.
Large language models represent a new class of clinical decision support because they behave more like cognitive partners than fixed calculators. Their biases do not automatically make them defective, and removing bias from model outputs alone may not improve care. A more useful evaluation paradigm asks where these systems fit in clinical workflows, how interaction can stimulate critical thinking and whether the human-AI dyad performs better than human reasoning alone. Clinical value depends on learning how to reason with the technology, not merely testing it against old frameworks built for earlier tools.
Source: BMJ Quality & Safety
Image Credit: iStock
References:
Rodman A & Zwaan L (2026) We need a new paradigm to think about generative AI. BMJ Quality & Safety: Online First.