Radiology reports are fundamental to patient management, with the impression section serving as a concise summary of findings and professional interpretation. These impressions are vital for guiding clinical decisions, yet producing them is often demanding and time-consuming for radiologists. The process requires precision, consistency and the ability to convey key diagnostic information clearly. High workloads increase the risk of inconsistencies and reporting errors, which can affect patient care. 

 

Artificial intelligence can become a potential solution that supports efficiency and accuracy in radiology workflows. While general-purpose large language models have demonstrated promise, their effectiveness in generating clinical summaries has been questioned. A recent study compared GPT-4o, a widely known general-purpose model, with a model specifically designed to summarise radiology reports, assessing whether a task-focused approach provides greater reliability in clinical practice. 

 

 

Development of a Specialised Model 

The specialised system, named LLM-RadSum, was developed to generate structured impressions from CT and MRI radiology reports. Built on the Llama 2 architecture, it was trained on 956,219 reports from a cancer hospital, with a further 106,247 reports used for internal testing. An additional 17,091 reports from four other hospitals formed the external test set. Only reports containing both findings and impressions were included to ensure completeness and clinical relevance. Follow-up reports, often lacking full impression sections, were excluded. Fine-tuning was applied to tailor the model for radiology reporting. 

 

Using supervised learning and an autoregressive approach, the model learned to predict sequences of words that aligned findings with accurate impressions. Performance was measured with F1 scores based on the longest common subsequence method, which captures the similarity between generated summaries and reference impressions. This metric balances precision and recall, providing a reliable evaluation of content accuracy and completeness. LLM-RadSum achieved strong internal test results, with a median F1 score of 0.75. On the external dataset, which included variations in reporting styles from multiple hospitals, performance decreased to 0.44. In human evaluation, however, the model reached 0.58, demonstrating consistency across diverse conditions and aligning with clinical expectations. 

 

Comparative Evaluation with GPT-4o 

To benchmark performance against GPT-4o, researchers created a human evaluation set of 1800 reports, balanced across modalities and anatomical sites. Three senior radiologists and two clinicians independently assessed the outputs, focusing on factual consistency, impression coherence, medical safety and clinical usefulness. 

 

The differences were significant. More than 81.5% of impressions generated by LLM-RadSum were considered suitable for direct use without major edits. In contrast, at least 27.8% of GPT-4o outputs required revision to meet professional standards. Radiologists judged 88.9% of LLM-RadSum’s summaries to be factually consistent with the source reports, compared with only 43.1% for GPT-4o. The specialised model also outperformed in coherence, with 97.8% of outputs rated as logically structured and concise. Medical safety evaluations revealed further advantages. Around 81.5% of LLM-RadSum’s outputs were safe to sign off directly, whereas most GPT-4o summaries contained minor flaws needing correction. Clinicians confirmed that LLM-RadSum addressed clinical questions in 91.3% of cases, compared with 72.2% for GPT-4o. These findings highlight the ability of a specialised model to produce summaries that closely match both radiologists’ and clinicians’ requirements.

 

Quantitative analysis reinforced these results. LLM-RadSum achieved a median F1 score of 0.58 across the human evaluation set, significantly higher than GPT-4o’s 0.30. The advantage was consistent across modalities, anatomical sites, patient sex and age groups. For MRI reports, the model reached a median score of 0.69 compared with 0.28 for GPT-4o, showing particular strength in this modality. 

 

Must read: Benchmark Datasets in Reproducible AI for Radiology 

 

Performance Across Clinical Variables 

The dataset represented a broad patient population, with mean ages in the mid-fifties and balanced gender distribution. CT reports formed the majority of cases, but MRI reports accounted for nearly a fifth, offering robust testing across modalities. Chest examinations were the most common, followed by abdominal and pelvic imaging. LLM-RadSum maintained higher performance across all anatomical sites. Median F1 scores ranged from 0.51 in chest imaging to 0.80 in breast imaging, while GPT-4o consistently scored below 0.35 across categories. Subgroup analysis revealed some variations. Older patients were associated with lower performance for the specialised model, possibly due to the complexity of their medical conditions and longer impression texts. Female patients’ reports showed slightly higher accuracy compared with male patients. 

 

Imaging modality also influenced outcomes. MRI was associated with a greater likelihood of higher performance compared with CT. In terms of anatomical regions, the specialised model performed particularly well in neck and head imaging, while breast and abdominal regions showed comparatively lower results. Despite these variations, LLM-RadSum consistently outperformed GPT-4o in every subgroup, confirming the benefits of task-specific training. Example cases demonstrated that the specialised model produced structured, clinically aligned impressions, while GPT-4o tended to introduce redundancy or over-diagnose negative findings. 

 

The comparison highlights the value of specialised large language models for clinical tasks. LLM-RadSum delivered superior factual accuracy, coherence and safety compared with a widely used general-purpose model. By closely replicating radiologists’ reporting style and addressing clinical needs more effectively, it has the potential to support efficiency and reliability in radiology workflows. Challenges remain, particularly around generalisability across institutions and performance in complex cases, but the findings indicate that tailored models offer clear advantages in healthcare applications. For radiologists and healthcare decision-makers, adopting domain-specific AI may provide a practical pathway to reducing reporting burdens while maintaining high standards of diagnostic communication. 

 

Source: Radiology 

Image Credit: iStock


References:

Zheng S, Zhao N, Wang J et al (2025) Comparison of a Specialized Large Language Model with GPT-4o for CT and MRI Radiology Report Summarization. Radiology, 316(2):e243774. 



Latest Articles

radiology AI, radiology report summarisation, GPT-4o comparison, specialised AI model, LLM-RadSum, medical AI accuracy, radiology workflow efficiency, CT MRI reporting AI, healthcare large language models, clinical decision support Specialised AI surpasses GPT-4o in radiology report summarisation, boosting accuracy, safety and clinical workflow efficiency.