Radiology reports are traditionally written for clinical communication and often contain technical terminology that is difficult for patients to interpret independently. Large language models (LLMs) are increasingly explored as tools to transform these reports into more accessible explanations while preserving medical accuracy. At the same time, privacy and data governance remain central concerns when patient information is processed using external systems. A comparative evaluation of one commercial closed-weight model and two locally deployed open-weight models examined how effectively radiology reports could be simplified for lay readers across several imaging modalities. The findings demonstrate clear improvements in readability and perceived understanding, while also identifying differences in error patterns that are relevant for safe patient communication in clinical environments.

 

Model Deployment and Structured Simplification

The evaluation used fictional radiology reports written in German by radiologists and based on realistic clinical scenarios. Reports included key clinical sections such as indication and impression and represented four imaging modalities: computed tomography, magnetic resonance imaging, X-ray and ultrasound. A structured prompt guided the models to produce simplified explanations organised into several components, including examination context, findings and a plain-language explanation of medical terminology.

 

Must Read: LLMs Streamline Thrombectomy Report Summaries

 

Three language models were assessed: GPT-4o as a closed-weight model, alongside Llama-3-70b and Mixtral-8x22B as open-weight alternatives. The open-weight models were deployed locally on hospital-controlled infrastructure using standard model weights without additional fine-tuning. Each report was processed multiple times with the same prompt to maintain consistency across outputs. In total, simplified versions were generated for all reports across the three systems, enabling comparison of readability, user perception and safety-related errors.

 

The locally deployed configuration demonstrated that open-weight models can operate within hospital infrastructure, supporting privacy-preserving workflows. At the same time, the structured prompting approach helped ensure that simplified reports followed a consistent explanatory format and preserved the original diagnostic meaning.

 

Readability Improvements Across Generated Reports

Simplified reports produced by all three models were substantially longer than the original radiology texts, reflecting the addition of explanatory language. Reading time increased from only a few seconds for original reports to roughly one minute for simplified versions. Word counts rose from fewer than 100 words in original reports to roughly 250 words in generated summaries. Sentence counts followed a similar pattern, with simplified outputs containing noticeably more sentences than clinician-oriented originals.

 

Despite this increase in length, readability improved significantly. Using a German-adapted Flesch reading ease measure, original reports scored in a very low readability range, while simplified outputs from all three models reached scores in the mid-forties. These values indicate a shift from highly technical clinical language toward text that is still complex but considerably more accessible for non-specialist readers. Differences between the models in readability scores were small and not statistically significant, suggesting that both closed-weight and open-weight approaches can achieve comparable clarity when guided by structured prompts.

 

Overall, the results indicate that simplification requires additional explanatory content, which increases report length but improves comprehension. The transformation reflects a trade-off between brevity and accessibility that is central to patient-oriented communication.

 

Layperson Understanding and Safety Considerations

Layperson evaluation formed an important part of the comparison. More than 20 participants assessed original and simplified reports using a five-point scale measuring general understandability. Participants reviewed multiple report sets presented in randomised order to reduce bias. The time required to complete the evaluation process was approximately 1 hour on average.

 

Ratings showed a strong improvement in perceived understanding for all simplified reports compared with original radiology texts. While original reports received very low scores, simplified outputs from all three models were rated above four on average. Differences between imaging modalities were not statistically significant, indicating consistent performance across CT, MRI, X-ray and ultrasound reports. Inter-rater agreement was substantial, supporting the reliability of the evaluation results.

 

An expert error review by radiologists identified differences in safety-relevant performance among the models. Critical errors with potential to mislead patients were observed only in the open-weight models. Mixtral-8x22B showed the highest number of such errors, appearing in several reports, while Llama-3-70b produced fewer but still measurable critical inaccuracies. The closed-weight model did not produce critical harm errors in the reviewed outputs. Minor inaccuracies without direct clinical risk were rare across all systems.

 

These findings highlight an important distinction between readability gains and clinical safety. Although open-weight models produced explanations that were similarly understandable to those generated by the closed-weight system, their higher rate of critical errors underscores the importance of clinical oversight when simplified reports are shared with patients.

 

Simplified radiology reports generated by large language models can significantly improve patient-level understanding across multiple imaging modalities. Both closed-weight and locally deployed open-weight models demonstrated the ability to transform technical radiology language into clearer explanatory text, with comparable readability outcomes and strong layperson ratings. However, differences in safety-related errors emphasise the need for verification by healthcare professionals before patient-facing use. Open-weight models offer advantages for privacy-sensitive hospital environments because they can operate on local infrastructure, but careful validation remains essential. The balance between accessibility, accuracy and data protection will shape how language models are integrated into radiology communication workflows.

 

Source: European Radiology

Image Credit: iStock


References:

Proff AK, Salam B, Hayawi M et al. (2026) Simplifying radiology reports with large language models: privacy-compliant open- versus closed-weight models. Eur Radiol: In Press.




Latest Articles

radiology LLMs, MRI analysis, open-weight models, radiology reporting AI, medical imaging AI, patient communication AI, healthcare data privacy LLMs simplify radiology reports for patients, improving readability while raising safety and privacy considerations in MRI and imaging workflows.