Large language models are taking on a growing role in healthcare, supporting clinical decision support, diagnostic assistance, administrative automation, electronic health record summarisation, patient interaction systems and automated medical coding. A recent study examined privacy and safety risks linked to these systems, focusing on memorisation of sensitive information, inference failures triggered by prompt design and retrieval hazards in systems that draw on external knowledge sources. Their ability to generate fluent text and process large volumes of clinical language has created clear operational value, but that value is closely tied to questions of safety, privacy and reliability. These concerns matter because unsafe outputs can affect patient care, while exposure of protected health information can undermine confidentiality and create legal and ethical risk. A structured assessment using synthetic benchmarks, de-identified clinical corpora and controlled prompt experiments compared general-purpose and healthcare-tuned models to examine how these risks appear in practice and how they differ across model types.

 

Memorisation Risks and Model Choice

Memorisation emerged as a measurable concern across all evaluated models, though its extent varied considerably. Exact match testing showed that GPT-4 had the highest rate of repeated content from the training corpus at 2.4%, while MedPaLM recorded 1.0%, ClinicalBERT 1.4% and BioBERT 1.2%. This pattern suggests that healthcare-specific fine-tuning can reduce the tendency to reproduce previously seen material, even if it does not remove the risk entirely. In environments where confidential records, clinical notes and other sensitive information may shape model training, even relatively low levels of repetition remain significant.

 

Prompt complexity also influenced memorisation. More complex prompts, defined by ambiguity, multiple pieces of information or technical medical terms, were associated with higher memorisation rates. GPT-4 showed the strongest correlation between complexity and memorisation, with a Pearson’s r of 0.63. MedPaLM followed at 0.51, BioBERT at 0.55 and ClinicalBERT at 0.48. The pattern indicates that more sophisticated queries can increase the likelihood of reproducing memorised material, particularly in larger general-purpose systems.

 

The comparative model profile reinforces the importance of model selection in healthcare settings. GPT-4, trained on a broader and more diverse corpus, was more prone to exact repetition. MedPaLM, ClinicalBERT and BioBERT all performed better in this respect, reflecting the effect of domain-specific training on medical, clinical and biomedical data. In practical terms, the findings favour specialised models where privacy and controlled generalisation are priorities. They also support the use of anonymised and de-identified data, regular monitoring and structured governance to keep memorisation risk within acceptable bounds.

 

Must Read: Medical LLMs Exposed to Prompt Injection Risks

 

Prompt Inference Errors in Clinical Contexts

Inference risk was examined through controlled prompt experiments using direct questions, contextual questions and open-ended scenarios. The findings show that prompt design has a strong effect on output safety. Direct questions produced the lowest error rate at 5.2%. Contextual questions, which included patient history or additional clinical detail, produced a 9.8% error rate. Open-ended scenarios generated the highest error rate at 14.5%, reflecting the greater difficulty of producing accurate responses when prompts are broad and less constrained.

 

Three main error types were identified. Semantic drift, rated at severity 2 to 3, involved responses that moved away from the intended meaning of the prompt. Logical inconsistency, rated at severity 4 to 5, referred to outputs that conflicted with established medical knowledge or guidelines. Hallucination, rated at severity 3 to 5, described fabricated claims or unsupported information. These categories show that not all inference failures carry the same weight. Some may create confusion, while others can directly shape unsafe decisions.

 

The case examples underline the seriousness of those failures. In one example, a prompt about a patient with asthma and COPD experiencing shortness of breath produced a recommendation to increase intake of oxygen-rich foods, an answer classified as logically inconsistent. In another, a prompt about the latest treatment for hypertension generated a claim about a permanent gene therapy cure, a hallucinated response with clear potential to mislead. The overall pattern indicates that ambiguity and openness increase the chance of harmful output, while structured and context-rich prompting can improve reliability, particularly in specialised healthcare models such as MedPaLM and ClinicalBERT. Prompt engineering, audit routines and human oversight therefore emerge as central controls rather than optional refinements.

 

Retrieval Hazards and Safe Integration

Retrieval-augmented generation introduces another layer of risk by linking model outputs to indexed external data. The main concern is that relevant retrieval can coexist with privacy leakage. Sensitive phrase recall rates show this clearly. GPT-4 retrieved sensitive phrases in 3.0% of retrievals, compared with 1.6% for MedPaLM, 1.2% for ClinicalBERT and 1.4% for BioBERT. Under the study’s privacy threshold, retrieval accuracy needed to remain above 90% and sensitive phrase recall at or below 2%. GPT-4 was classified as non-compliant because it exceeded the sensitive phrase threshold, while the other three models remained compliant.

 

Retrieval accuracy itself was high across all models. GPT-4 reached 90%, MedPaLM 95%, ClinicalBERT 94% and BioBERT 96%. Yet the findings make clear that high retrieval accuracy alone does not guarantee safe performance. Even limited retrieval misalignment can expose irrelevant or sensitive material, and that material may then shape the generated response. In healthcare, where patient identifiers, treatment histories and protected health information may be present in connected sources, that risk is operational as well as technical.

 

The study points towards a practical route for safer deployment. Domain-specific fine-tuning, de-identified training data, secure indexing, query filtering, access controls and privacy-preserving techniques such as anonymisation and differential privacy all help reduce exposure. Human-in-the-loop validation adds a further safeguard when outputs may influence diagnosis or treatment planning. A risk assessment framework built around memorisation, inference and retrieval checks can align these controls with existing clinical risk management processes and support more cautious integration into real-world workflows.

 

Healthcare use of large language models brings measurable gains in language processing and workflow support, but those gains sit alongside identifiable privacy and safety risks. Memorisation rates were higher in the general-purpose model, inference errors increased with prompt complexity and openness, and retrieval systems introduced a distinct risk of sensitive phrase reconstruction even when overall retrieval accuracy remained high. Across memorisation, inference and retrieval, domain-specific fine-tuning improved performance and reduced exposure. Safe deployment therefore depends less on model capability alone than on disciplined implementation: careful model selection, de-identified data practices, robust prompt standards, retrieval safeguards, continuous monitoring and clinical oversight. These measures define a more realistic path for integrating large language models into healthcare without losing sight of patient confidentiality, clinical reliability and operational accountability.

 

Source: International Journal of Innovative Science and Research Technology

Image Credit: iStock

 


References:

 Akande OA, Ijiga OM, Bamigwojo OV et al. (2026) Assessment of Memorization, Prompt Inference, and Retrieval Risks in Healthcare Large Language Models. International Journal of Innovative Science and Research Technology, 2887–2916.



Latest Articles

LLM risk in healthcare, healthcare AI safety, medical large language models, prompt injection healthcare AI, healthcare data privacy AI, clinical AI governance, MedPaLM ClinicalBERT risk Managing LLM risk in healthcare: study reveals privacy, memorisation, prompt inference and retrieval risks, highlighting safer AI deployment strategies.