Accurate and timely interpretation of radiology reports is critical for monitoring cancer progression and informing treatment decisions. However, the free-text nature and variability of these reports often pose challenges for oncologists, especially when dealing with large volumes of patient data. Recent advancements in artificial intelligence, particularly large language models (LLMs), offer a promising solution to structure and interpret these complex reports. A recent comparative study evaluated the performance of GPT-4 and Gemini, two leading LLMs, in analysing abdominal CT scan reports of cancer patients to identify oncological issues that require clinical attention.
Comparative Evaluation of GPT-4 and Gemini
The study retrospectively analysed data from 205 patients, each with two consecutive abdominal CT scan reports. Both GPT-4 and Gemini were prompted with a structured three-step task: to match findings between the two reports, to filter out tumour-related findings relevant to chemotherapy decisions and to classify the progression of these findings. In the first task of matching findings, GPT-4 achieved a significantly higher accuracy, correctly matching 96.2% of findings, while Gemini achieved 91.7%. GPT-4 had no omissions and fewer mismatched or fabricated correspondences.
Must Read: Comparing Open- and Closed-Source LLMs for Radiology Error Detection
In identifying oncological issues, GPT-4 again outperformed Gemini. Excluding findings labelled as “no tumour description” and “other malignancy,” the precision, recall and F1 scores of GPT-4 were higher than those of Gemini. GPT-4 recorded a precision of 0.68, a recall of 0.91 and an F1 score of 0.78. In contrast, Gemini showed a precision of 0.63, a recall of 0.78 and an F1 score of 0.70. These results indicate GPT-4’s stronger ability to identify relevant tumour-related findings with fewer missed detections. In determining tumour status across 306 relevant findings, GPT-4 correctly assigned the status in 87.6% of cases compared to 73% for Gemini, showing superior performance, particularly in identifying stable disease.
Understanding Errors and Model Limitations
Despite the advantages demonstrated by GPT-4, both models exhibited some notable limitations. One key issue was the high number of false positive findings, where benign conditions were misclassified as oncological issues. These false positives were often caused by findings that included medical terminology without specifying whether they were benign or malignant. In both models, this lack of specificity led to frequent misclassification. For GPT-4, 50% of false positives fell into this category, and Gemini had a similar proportion. Additionally, GPT-4 sometimes excluded tumour-related findings where a low probability of malignancy was described, misprioritising benign over malignant possibilities.
False negative findings were also a concern. GPT-4 produced 28 false negatives, whereas Gemini produced 67. The most significant problem for GPT-4 was its tendency to overlook malignancies when benign features were also present in the same finding. For Gemini, a larger issue was the misinterpretation of findings that were clearly described as malignant. In such cases, Gemini often incorrectly classified them as benign, which significantly lowered its recall performance. Furthermore, Gemini incorrectly categorised many tumour findings with stable progression as benign, which impacted the accuracy of its status assignments. While GPT-4 performed better across most tumour status categories, its performance also declined in recognising stable conditions compared to improved or aggravated ones.
Clinical Implications and Future Directions
The results of this study indicate the potential of large language models, particularly GPT-4, in assisting oncologists with the analysis of serial CT scan reports. By effectively identifying tumour-related findings and changes in disease status, GPT-4 can help reduce the cognitive load on clinicians and enhance the efficiency of decision-making. This capability is particularly valuable in oncology, where large volumes of imaging data and clinical results must be synthesised during patient consultations.
Although both models completed their analysis in under 30 seconds per report pair—an acceptable timeframe in clinical settings—there is still a need for improvement in precision. Many of the inaccuracies stemmed from misclassification of medical terminology. Addressing this issue may involve refining the language models through additional training on medical datasets and improving the structure of prompts used during analysis. These enhancements could further improve the models’ accuracy and their reliability in clinical environments.
Additionally, the study revealed certain limitations in its design, including the use of English-language reports written by non-native speakers, which may introduce linguistic inconsistencies. The retrospective nature of the study also limits its generalisability to real-world clinical practice. Future research should explore the use of LLMs in multilingual and prospective clinical environments to evaluate their performance more comprehensively.
The study provides strong evidence of the potential benefits of LLM-assisted analysis in cancer imaging. GPT-4 consistently outperformed Gemini in matching findings, identifying relevant oncological data and determining tumour status. While both models demonstrated rapid analysis times, the superior accuracy of GPT-4 suggests its readiness for potential integration into clinical workflows. Nonetheless, improvements in precision, model training and prompt design are necessary to address current limitations. With further development, LLMs such as GPT-4 could become integral tools in enhancing oncological surveillance and supporting cancer care delivery.
Source: Academic Radiology
Image Credit: iStock