Large language model (LLM) chatbots are becoming commonplace in clinical contexts, including for generating differential diagnoses. Outputs can vary depending on how users frame and structure inputs, yet routine use rarely mirrors the carefully engineered prompts used in controlled evaluations. Prior controlled work reported that an LLM chatbot, when given an entire clinical vignette in a structured prompt, could score higher on clinical reasoning cases than physicians completing the same cases with access to the same chatbot. One proposed explanation was that physicians decide how much of a case to enter, potentially limiting what the system can consider. A sequential mixed methods research programme examined how physicians interacted with a GPT-4 chatbot during clinical reasoning tasks and whether the amount of case content included in inputs related to performance.

 

Four Common Input Approaches
Interviews with 22 US physicians identified two axes in clinician–chatbot interaction: how much vignette content was entered and how the chatbot was instructed to use it. When researchers analysed recorded chat logs from two randomised controlled trials, content amount could be identified consistently, while prompt style was harder to code reliably because inputs could shift within a case and the intent behind directives was not always clear. The resulting typology therefore focused on four approaches defined by content amount, from most to least.

 

The first approach involved copying and pasting the entire vignette. Many physicians described this as the fastest and simplest method because it avoided retyping or reformulating information. It was also linked to the expectation that fuller inputs would yield more robust outputs. For diagnostic cases, complete copy-pasting often produced detailed differentials that could be skimmed for idea generation or confirmation, although some clinicians found responses overly verbose.

 

The second approach involved selectively copying and pasting only parts of the vignette, such as specific symptoms, examination findings or laboratory results. Physicians used this to focus the chatbot on elements they considered most relevant or to avoid overwhelming the system with too much information. Outputs were generally viewed as helpful. However, clinicians who later compared approaches reported that selective copying could take longer than pasting the full case and could yield less detailed responses.

 

Must Read: The Expanding Role of Chatbots in Modern Medicine

 

The third approach was to summarise the vignette in the physician’s own words, distilling what was judged most pertinent. This offered flexibility to combine details from across the vignette while excluding information considered irrelevant, aiming to prevent the chatbot from pursuing unhelpful tangents. For some, summarising supported cognitive engagement by reducing the sense of delegating reasoning to the chatbot. A drawback was the risk of omitting details that might matter, especially when compared with the chatbot’s ability to notice aspects present in a full vignette that a clinician might gloss over in a summary.

 

The fourth approach treated the chatbot like an internet search tool, using brief, targeted queries. Some clinicians used this style because it aligned with established habits of looking up information online. Others preferred it after observing that larger inputs tended to generate larger outputs and they wanted concise responses tailored to a specific question. Searching was often used when the physician already had a clear direction and needed only a small amount of confirmatory or supplementary information. A minority reported difficulty designing queries that generated relevant outputs.

 

What the Trials Showed About Performance
After developing the typology, researchers analysed chat logs from the intervention arms of two randomised controlled trials in which physicians completed clinical vignettes with access to GPT-4 alongside usual resources. In the diagnostic trial, 25 physicians participated, with 22 completing 95 cases that had both an accessible chat log and a score, while chat logs for three physicians could not be accessed. In the management trial, 46 physicians participated, with 42 completing 158 cases with both an accessible chat log and a score, while chat logs for four physicians could not be accessed. Cases that were started but not completed, including those with a score of 0, were excluded from analysis.

 

Across diagnostic and management tasks, all four input approaches appeared, but copy-pasting and searching were most common. In some instances, more often in management cases, participants completed a case without using the chatbot. For diagnostic cases, one to two participants did not use the chatbot on four of six cases. For management cases, two to five participants did not use the chatbot for each of the five cases.

 

A linear mixed-effects model compared case scores across input types, accounting for clustering by participant and by case and including a term distinguishing diagnostic from management reasoning. The analysis found no single content-amount approach associated with higher scores for either diagnostic or management cases. This challenged the assumption that providing more of the vignette would, by itself, translate into better clinical reasoning performance when physicians used the chatbot.

 

Implications for Implementation and Training
The findings suggest that content amount alone does not explain performance differences in clinical reasoning tasks when physicians use LLM chatbots. While the amount of information entered can shape outputs, performance may depend more on how clinicians filter, interpret and incorporate outputs into their reasoning. Physicians varied in perceived efficiency and usefulness. Some struggled to craft queries that produced helpful responses, while others found outputs too long when seeking targeted information.

 

Management tasks raised additional concerns. Interviewed physicians generally disliked GPT-4 outputs for management cases, describing them as too broad or insufficiently sensitive to patient- and setting-specific nuance, and reported relying more on their own clinical experience for these tasks. This aligns with the observed tendency for the chatbot to be used less often in management vignettes.

 

The typology offers a practical lens for real-world support because it reflects how clinicians naturally interact with chatbots, not how prompts are optimised in experimental conditions. Training could be tailored to different problem-solving styles and tasks, rather than assuming a single best method of entering clinical information. Developers could also design interfaces that better match clinical workflows, whether by facilitating structured entry of complete vignettes, enabling rapid extraction of key details for selective pasting or supporting targeted, search-style queries. Although integration with the electronic health record (EHR) is increasing, external LLM applications are expected to remain common in the interim, reinforcing the need for clear guidance on appropriate tasks, limitations and efficient workflows.

 

Physicians use multiple distinct approaches when entering clinical case information into LLM chatbots, ranging from pasting full vignettes to issuing short search-like queries. In analysed diagnostic and management vignette tasks, none of these content-amount approaches was associated with higher clinical reasoning scores. Effective use is therefore unlikely to be achieved simply by encouraging more comprehensive data entry. Greater emphasis may be needed on purposeful training, prompt practices beyond content volume and support for clinicians’ interpretation and cognitive engagement when using chatbot outputs in clinical decision-making.

 

Source: npj digital medicine

Image Credit: iStock


References:

Siden R, Kerman H, Gallo RJ et al. (2025) A typology of physician input approaches to using AI chatbots for clinical decision-making. npj Digit Med: In Press.



Latest Articles

LLM chatbots, clinical reasoning, GPT-4 in healthcare, AI diagnosis, digital medicine, physician AI use, medical chatbots Study finds how physicians use LLM chatbots in diagnosis and care—and why more data doesn’t improve results.