Clinical AI scribes are being tested as healthcare services consider whether ambient voice tools can produce reliable clinical summaries for different types of speech. A controlled single-system simulation published in BMJ Health Care Informatics assessed whether patient communication style, international English accents and speech impairments affected the accuracy of a clinical AI scribe. Testing took place in a simulated primary-care environment at the NHS England South-West Centre of Digital Excellence at the University of the West of England, Bristol, between January and September 2025. The system produced clinical summaries from simulated consultations and transcripts from speech-impairment recordings. Errors were grouped as omissions, factual inaccuracies or hallucinations. Performance was broadly stable across communication styles and most accents, but some speech characteristics created clear weaknesses.

 

Communication Style Has Limited Impact

The evaluation used four simulated primary-care scenarios: diarrhoea, sleep apnoea, work stress and headache. Each scenario was acted by a professionally trained actor and repeated across five communication styles based on the Big Five traits: openness, conscientiousness, extraversion, agreeableness and neuroticism. In total, there were 20 consultations, each lasting between seven and 19 minutes. The clinical AI scribe used its default settings and produced summaries in a SOAP format, covering Subjective, Objective, Assessment and Plan content.

 

Each generated summary was checked against a human-verified transcript. The total error count combined omissions, factual inaccuracies and hallucinations. All five communication styles produced some errors, but total errors did not differ significantly between them. Extraversion produced the highest median total error count at 3.5, mainly because of omissions. Conscientiousness and agreeableness had lower median total error counts, at 1.5 and 2.0 respectively. Openness and neuroticism showed more variation in error type and frequency, especially in emotionally complex consultations such as work stress, but the differences were not statistically significant.

 

Accents Mainly Lead to Omissions

Accent testing kept the consultation wording the same while changing the speaker accent. Verbatim transcripts were generated, manually corrected and converted into turn-by-turn dialogue scripts. Synthetic speech was then generated using intelligible accented voices, with a neutral synthetic voice used as the baseline. Each scenario had two versions: an accented patient with a baseline synthetic doctor and an accented doctor with a baseline synthetic patient. Human-accented patient speech was also recorded for Irish, Chinese and Nigerian English accents.

The accent dataset covered five clinical scenarios: diarrhoea, headache, prostate, skin rash and sleep apnoea. It included American, Chinese, Indian, Irish and Scottish synthetic voices, a synthetic baseline and additional human-accented consultations. The full accent evaluation included 70 consultations. Error rates did not differ significantly by patient accent or doctor accent. Adding an accent to either the doctor or the patient did not significantly change total error counts, and performance was not materially altered by whether the accented speaker was the clinician or the patient.

 

Must Read: AI Scribes Gain Accuracy with Visual Context

 

Most accent-related errors were omissions. Factual inaccuracies and hallucinations stayed low across accent conditions, with medians at or near zero. Scottish-accented patients showed slightly higher median omissions, while Chinese-accented and Indian-accented doctors had the highest median omissions. The American-accented doctor voice showed wider variation, with occasional higher omission counts. These differences were not statistically significant. Total errors were similar for human and synthetic patient accents, supporting cautious use of high-quality synthetic voices for accent testing when balanced human speech collections are unavailable.

 

Speech Impairments Reveal Greater Risk

Speech-impairment testing used publicly available recordings covering five profiles: phonological impairment, vowel disorder, childhood apraxia of speech, articulation disorder and cleft palate. These recordings were not medical conversations, so the clinical AI scribe did not generate summaries. Instead, it attempted to create transcripts, which were checked against human-confirmed transcripts.

 

The main measure was the percentage of words recognised correctly. Performance varied strongly by impairment type. Cleft-palate recordings had the highest recognition accuracy and did not differ significantly from perfect recognition. Vowel-disorder recordings showed wider variation but also did not differ significantly from perfect recognition. Phonological impairment led to a significant fall in accuracy. The system had marked difficulty extracting information from affected speech. Articulation deficits also caused substantial transcription errors.

 

The transcript-level analysis of accented speech showed several recurring error types. These included confusion around short function words, incorrect word boundaries, voicing errors in weak positions and problems with final sounds that affected negation. Rare words and names were also sometimes replaced with other words, and possessive forms were sometimes confused with plurals. Overall, accent-related problems mainly appeared as missing information rather than fabricated or wrong clinical content. Phonological impairment created a more serious transcription problem.

 

The clinical AI scribe showed no detected significant difference in total errors across simulated communication styles or international English accents under controlled conditions. Most errors were omissions, while hallucinations and factual inaccuracies remained low. Speech impairment produced a less consistent result, with good recognition for cleft palate and vowel disorders but much weaker accuracy for phonological impairment. Safe use depends on clinician checks, subgroup performance monitoring and clear switch-off criteria for high-risk speech profiles. Local validation, audit and attention to safety-critical fields remain essential.

 

Source: BMJ Health & Care Informatics

Image Credit: iStock


References:

Draper TC, Leake J, Cox T et al. (2026) AI-generated clinical summaries: errors and susceptibility to speech and speaker variability. BMJ Health & Care Informatics;33:e101918.




Latest Articles

clinical AI scribes, ambient voice healthcare, AI medical transcription, speech recognition healthcare, NHS AI study, healthcare informatics, AI accuracy, clinical documentation Clinical AI scribes show stable accuracy across accents and styles but struggle with speech impairments, raising safety and validation concerns.