Large language models are increasingly used to support clinical documentation, including transcripts of spoken doctor-patient encounters. In a simulated setting, Google’s NotebookLM generated speaker-labelled transcripts from a set of standardised-patient interactions, which were then compared with manually prepared transcripts. The comparison examined word errors, speaker-attribution problems and the effects of overlapping speech, audio fidelity and medical terminology. Overall error rates were modest, but semantic inaccuracies, speaker confusion and terminology mistakes remained frequent enough to raise concerns about fully autonomous use in clinical documentation.
Errors Were Modest but Not Minor
NotebookLM produced a relatively low overall transcription error rate across the simulated encounters, but the pattern of those errors is more important than the topline figure alone. Deletions were the most common problem, while substitutions were less frequent and insertions were comparatively rare. In practice, the system was more likely to miss spoken words than to invent new ones. Semantic errors also appeared regularly, meaning that some inaccuracies changed meaning rather than simply reducing fluency.
Speaker-attribution problems added a second risk. Some speaking turns were assigned to the wrong person, and some correctly transcribed words were attached to the wrong speaker. In a clinical setting, that distinction matters because what a patient says and what a clinician says carry different weight in history taking, assessment and documentation.
Performance also varied across the encounters. Some standardised-patient interactions produced fewer errors, while others generated more transcription and turn-taking problems. That variation shows that accuracy was not stable even within a controlled simulated scenario using the same transcription approach.
Must Read:Standardised Codes Improve Multi-Hospital Data Quality
Patient Speech and Overlap Were Harder to Handle
Errors were not distributed evenly across speakers. Medical students spoke more than the standardised patients, but patient speech showed a higher rate of transcription failure. Deletions were especially common in the patient side of the encounter. That imbalance is notable because patient language often contains the symptoms, concerns and contextual details that shape documentation.
Conversation flow also influenced performance. Encounters with more speaking turns tended to produce more transcription and speaker-attribution errors. Overlapping speech was a particularly important source of difficulty. It accounted for a meaningful share of both total word errors and mis-attributed speaker words, showing that interruptions or simultaneous speech could quickly reduce accuracy. These are common features of real clinical conversations, especially when clarification, reassurance or follow-up questions interrupt a turn.
Medical terminology created another persistent challenge. Errors affected medication names, conditions, symptoms, anatomy, examination findings and diagnostic tests. These mistakes made up a notable share of the total error burden and an even larger share of semantic errors. A prompt change designed to help the model anticipate expected terminology did not improve performance.
Better Audio Improved Results but Did Not Eliminate Risk
A second part of the work tested whether recording quality explained much of the problem. Some encounters were re-recorded under better audio conditions, in a quieter environment, with different equipment and without overlapping speech. Under those conditions, transcription and speaker-attribution errors fell significantly. The biggest improvement came from a reduction in deletions, while semantic and turn-taking errors also declined.
Even so, better audio did not solve the underlying problem. Errors remained despite the cleaner recording conditions, including mistakes affecting meaning, terminology and speaker assignment. That distinction matters because it suggests two separate pressures on performance. Some errors are linked to modifiable factors such as noise, overlap and recording quality. Others persist even when those conditions improve.
For clinical use, that limits how far technical refinements alone can reduce risk. In this setting, the transcript was not a minor administrative output but the basis for clinical documentation. Even a modest number of meaning-altering inaccuracies can therefore matter.
In these simulated doctor-patient encounters, NotebookLM delivered modest overall error rates, but clinically important inaccuracies remained. Deletions were the most common error type, patient speech was more vulnerable to error than student speech, and overlapping talk, medical terminology and audio fidelity all shaped performance. Better recording conditions improved results substantially but did not remove the core risks linked to meaning and speaker attribution. The findings support a cautious approach to fully autonomous note generation and point instead to a more supportive role for large language models within clinician-led documentation workflows.
Source: BMC Medical Informatics & Decision Making
Image Credit: iStock
References:
Nolan EJ & Burke HB (2026) Accuracy of large language model transcription of simulated physician-patient verbal interactions. BMC Med Inform Decis Mak: In Press.