Large language models are increasingly used to generate structured medical text, including radiology reports. Lumbar spine MRI reporting follows established formats, making it a suitable context to assess automated report generation. A blinded evaluation compared AI-generated reports with radiologist-written reports across quality and recognisability.

 

Comparative Evaluation of Human and AI Radiology Reports

Evaluation compared lumbar spine MRI reports written by experienced radiologists with reports generated by a large language model. The analysis included 125 anonymised reports composed of 104 human-written reports and 21 reports produced using ChatGPT-4o. All reports followed a standardised institutional reporting format and contained the clinical indication, description of imaging technique and narrative interpretation of imaging findings. Radiologists originally authored the human reports using routine clinical data obtained from lumbar spine MRI examinations performed between January and June 2024.

 

The AI-generated reports used predefined clinical scenarios representing common indications such as low back pain, lumbar disc herniation, degenerative disc disease, spinal canal stenosis, postoperative follow-up, suspected spondylolisthesis and spinal tumours. Prompt inputs included clinical information, imaging techniques and expected findings so that the generated reports followed the same structured template as the human reports. The language model did not access the underlying MRI images and relied solely on the textual prompts provided.

 

Before evaluation, all reports were anonymised and randomised. Reviewers examined only the textual reports without access to the original imaging studies or additional patient information. Evaluation therefore focused on linguistic quality, clarity and internal coherence rather than image-based diagnostic verification. Reviewers assessed clinical relevance, clarity of technique description, completeness of findings, accuracy of conclusions, intelligibility and adherence to structured reporting standards using a five-point Likert scale.

 

Must Read: LLMs Simplify Radiology Reports for Patients

 

Across these domains, radiologist-written reports consistently achieved higher scores than AI-generated reports. Higher ratings reflected stronger organisation, clearer description of findings and more precise diagnostic interpretation. No statistically significant difference emerged in the description of imaging technique, suggesting that both human and automated reports handled this technical section similarly.

 

Clinician Ability to Distinguish AI and Human Reports

The evaluation also examined whether clinicians could identify the origin of each report. Five medical professionals participated in the blinded assessment: a board-certified radiologist, two radiology residents, one general practitioner and one orthopaedic surgeon. Each reviewer analysed the same anonymised dataset and classified every report as either human-written or AI-generated.

 

Performance varied substantially between professional profiles. The board-certified radiologist correctly identified report origin in most cases and achieved an overall accuracy of 88%. One radiology resident reached the highest accuracy of all reviewers at 94.4%, while the second resident achieved 65.6%. The general practitioner also achieved 65.6% accuracy and frequently misclassified human reports as AI-generated. The orthopaedic surgeon demonstrated balanced sensitivity and specificity, reaching an accuracy of 78.4%.

 

These results highlight the difficulty of distinguishing AI-generated text from human writing, particularly for readers without specialised radiology training. Several AI-generated reports were misidentified as human-written, indicating that automated outputs can reproduce the structure and tone typical of professional radiology documentation.

 

Despite this stylistic similarity, reviewers consistently rated radiologist-written reports as superior. Board-certified radiologists and residents gave significantly higher scores to human reports across clinical relevance, imaging findings, conclusions, intelligibility and adherence to reporting guidelines. AI-generated reports generally showed lower scores for the clarity of clinical information and completeness of findings, although reviewers found their overall structure broadly comparable to standard reporting formats.

 

Clinical Interpretation and Workflow Implications

The evaluation highlights a key distinction between linguistic realism and diagnostic depth. Large language models can generate coherent radiology narratives that resemble professional documentation, yet human reports still demonstrate stronger contextual understanding and more precise clinical interpretation. The automated reports did not contain clinically false statements, but reviewers noted that they sometimes lacked the specificity expected in radiologist reporting.

 

Variability among reviewers suggests that professional experience influences the ability to detect subtle inconsistencies in AI-generated text. The orthopaedic surgeon performed relatively well in distinguishing report origins, possibly reflecting frequent interaction with radiology reports during surgical decision-making. In contrast, the general practitioner often attributed reports to AI, demonstrating the difficulty non-specialists face when evaluating stylistically convincing medical text.

 

Structured reporting practices also shape interpretation. Radiologist-written reports typically follow institutional formatting conventions that align closely with clinical decision-making needs. Such consistency may contribute to clearer communication with referring clinicians, particularly surgeons who rely on imaging reports to support treatment planning.

 

These observations indicate that large language models could assist in drafting structured reports while maintaining human oversight. A workflow in which radiologists input clinical details and imaging findings, followed by automated generation of a structured draft, could reduce documentation time and improve consistency. Radiologists would then review and finalise the report, ensuring diagnostic accuracy and contextual interpretation remain under expert control.

 

Evaluation of lumbar spine MRI reporting shows that radiologist-written reports retain clear advantages in clinical relevance, diagnostic precision and structural quality. AI-generated reports demonstrate coherent language and realistic formatting but remain less detailed and less contextually precise. Clinicians sometimes misidentify automated reports as human-written, particularly when the reader lacks specialised radiology expertise. These findings indicate that large language models can reproduce the style of professional reporting while still requiring expert oversight to ensure clinical reliability. Integration of automated drafting within a supervised workflow could support reporting efficiency and consistency, provided radiologists retain responsibility for final interpretation and report validation.

 

Source: European Radiology Experimental

Image Credit: iStock


References:

Zanardo M, Albano D, Molinari V et al. (2026) Can AI write reports like a radiologist? A blinded evaluation of large language model-generated lumbar spine MRI reports. Eur Radiol Exp; 10, 16.




Latest Articles

AI radiology reports, lumbar spine MRI, radiologist vs AI, medical imaging AI, structured reporting, ChatGPT radiology, MRI analysis Study compares AI vs radiologists in lumbar spine MRI reporting, highlighting accuracy, clarity, and clinical reliability of structured reports.