The use of artificial intelligence (AI), particularly large language models (LLMs), is gaining traction in healthcare. While these technologies have primarily supported clinicians, helping manage workload and reduce burnout, their potential in enhancing patient understanding remains underexplored. Following the 21st Century Cures Act and the rise in open notes, patients now have unprecedented access to their clinical documentation. However, this access often comes with the challenge of interpreting complex medical language. A proof-of-concept study sought to determine whether LLMs could effectively assist patients in making sense of open visit notes, thereby supporting informed decision-making and engagement in their care.

 

Patient-Centric Design and Evaluation Framework 

The study involved three widely accessible LLMs—ChatGPT 4o, Claude 3 Opus and Gemini 1.5—evaluated through patient-generated questions based on a real neuro-oncology progress note. A cross-sectional design was employed, featuring four distinct prompt styles: Standard Order, Randomised Order, Persona and Randomised Persona. The inclusion of “Persona” prompts, where the LLM is instructed to answer as a specialist, was particularly noteworthy. Responses were scored independently by both the note’s author, a neuro-oncologist and the patient to whom the note referred. An eight-criterion rubric assessed aspects including Accuracy, Relevance, Clarity, Actionability, Empathy/Tone, Completeness, Evidence and Consistency.

 

To ensure objectivity, evaluators were blinded to the identity of the LLM and prompt style. The design highlighted methodological rigour, using descriptive statistics, ANOVA and Kruskal-Wallis tests to compare model performance. Inter-rater reliability was also calculated, revealing interesting contrasts between the perspectives of the clinician and the patient.

 

Must Read: Improving Radiology Reports with AI & Common Data Elements 

 

Insights into Model Performance and Prompt Impact 

The study demonstrated that all models provided usable responses, though performance varied by model and prompt type. Persona-style prompts consistently enhanced the quality of LLM responses, particularly in Clarity and Empathy/Tone. ChatGPT 4o using the Persona prompt scored highest overall, followed closely by Claude 3 Opus. Gemini 1.5 showed more variability and lagged in metrics such as Relevance and Evidence.

 

Despite general alignment between patient and clinician ratings for Accuracy and Actionability, discrepancies appeared in subjective measures like Empathy/Tone and Evidence. Patients tended to rate Clarity and Completeness lower than clinicians, reflecting different expectations and interpretations. Notably, the use of evidence within responses was weak across all models, suggesting a gap between AI-generated content and evidence-based medical guidance.

 

The prompt format was a key determinant of success. The Standard and Persona formats outperformed their Randomised counterparts, indicating that both sequencing and contextual framing significantly influence output quality. The interaction between model and prompt style was statistically significant, reinforcing the need for tailored approaches to prompting when designing AI tools for patient use.

 

Implications for Patient Engagement and Future Healthcare Models 

The findings underscore the potential of LLMs as support tools for patients navigating clinical information. As healthcare shifts toward greater transparency and shared decision-making, technologies that demystify clinical content can empower patients. However, effective deployment will require guidance in crafting prompts, education on interpreting AI-generated responses and awareness of model limitations.

 

Rather than discouraging patient use of AI, healthcare systems could proactively support it, offering suggestions for effective prompting and clarifying the appropriate use of LLMs. This cultural shift could help patients manage their care more confidently and reduce cognitive burden, particularly during the often-overlooked “in-between” moments outside clinical encounters.

 

Importantly, the variability across models calls for clinician awareness of these tools’ strengths and weaknesses. While clinicians are not expected to master AI, they should understand that LLMs, like other medical technologies, are not all the same. This recognition can guide more constructive conversations with patients who rely on such tools.

 

The study also highlighted the need to refine evaluation metrics based on patient goals. For example, a patient preparing for a follow-up appointment may prioritise Actionability and Relevance over Empathy or Evidence. Future studies should explore model performance with memory-enabled features, which could personalise responses over time and improve contextual accuracy. 

 

This proof-of-concept study offers a foundational understanding of how LLMs can assist patients in interpreting open clinical notes. Incorporating a Persona instruction significantly improved performance across key metrics, suggesting that patient outcomes could benefit from prompt refinement and user training. While all three models showed potential, differences in reliability and user experience point to the need for continued evaluation. As AI technologies evolve, the healthcare sector must ensure patients are equipped not only with access to information but also with the tools and knowledge to interpret it meaningfully. Collaboration between patients, clinicians and technologists will be essential to fully harness the promise of AI in patient-centred care. 

 

Source: JAMIA Open 

Image Credit: Freepik


References:

Salmi L, Lewis DM, Clarke JL et al. (2025) A proof-of-concept study for patient use of open notes with large language models. JAMIA Open, 8(2):ooaf021. 



Latest Articles

AI in healthcare, large language models, patient engagement, clinical notes, ChatGPT, LLMs, patient-centred care, healthcare AI tools, open notes interpretation AI helps patients decode medical notes, improving clarity, empathy and engagement in healthcare decisions.