The Ovarian-Adnexal Reporting and Data System (O-RADS) for MRI provides a structured method for stratifying malignancy risk in adnexal masses. Endorsed by the American College of Radiology, it categorises findings into five levels of risk, aiming to enhance decision-making and communication between radiologists and referring clinicians. Despite its value, adoption in routine clinical practice remains inconsistent, with radiologist interpretation and application of O-RADS rules varying. Existing calculators require manual interaction, which can be inefficient.
With recent advances in natural language processing, particularly through large language models (LLMs), there is an opportunity to automate the calculation of O-RADS MRI scores directly from radiology report descriptions. A recent review published in Radiology examined the performance of an optimised LLM-based approach, including a hybrid strategy combining AI-driven feature classification with deterministic scoring logic, in assigning O-RADS scores from pelvic MRI reports.
The Promise of Hybrid Automation in Radiology
In this retrospective single-centre study, two LLM-based strategies were evaluated using pelvic MRI reports from a regional cancer centre. All reports described at least one nonphysiologic adnexal lesion and were drawn from two time periods: after and before the implementation of the O-RADS MRI system. The first approach, termed "LLM only", employed GPT-4 with few-shot learning and direct prompting with O-RADS rules to assign scores. The second, referred to as the "hybrid" model, involved GPT-4 classifying lesion features extracted from report text, which were then processed using a deterministic formula to generate the final O-RADS score.
The hybrid model demonstrated superior accuracy across internal test sets. In the set of 173 lesions where radiologists had previously assigned O-RADS scores, the hybrid model achieved 97% accuracy, outperforming the LLM-only model at 90%. For lesions where an O-RADS score had originally been reported (n = 158), the hybrid model again surpassed both the LLM-only model and the original radiologist, with an accuracy of 97%, compared to 89% and 88%, respectively. The hybrid model performed particularly well in categorising O-RADS 2 lesions, correctly classifying 98% of such cases, in contrast to the radiologist’s 86%. Additionally, the model maintained high accuracy when tested on 183 lesions from earlier reports predating the implementation of O-RADS MRI, achieving 95% compared to 87% for the LLM-only strategy. These findings support the hybrid model's robustness in handling diverse report styles and its potential for retrospective or prospective clinical use.
Strategic Optimisation: The Key to Model Performance
The study highlighted the necessity of strategic optimisation for LLM applications in radiology. Out-of-box models often struggle with the complexity of clinical scoring systems. To address this, the hybrid approach utilised prompt engineering and structured classification of features such as lesion size, location, enhancement characteristics and tissue composition. GPT-4 was used to produce structured outputs in JSON format, which were subsequently fed into a deterministic scoring algorithm based on O-RADS MRI rules.
Recommended Read: Updated Imaging Guidelines for Ovarian Cancer
This strategy allowed the model to excel where standard LLMs fall short. The hybrid model demonstrated almost perfect agreement with expert-reviewed reference standards (κ = 0.95), compared with strong agreement between the original radiologist scores and the reference standard (κ = 0.91) and the LLM-only model (κ = 0.87). Discrepancies in hybrid model outputs were rare. Of the 158 lesions with original scores, the hybrid model differed in 21 cases, and in 81% of those, it matched the reference standard. The model’s output also offered interpretability by revealing which specific features were used to determine each score. The most common causes of discrepancies were misclassification of features such as solid enhancing tissue and fluid content. Even with these limitations, the model demonstrated consistent performance across datasets, including reports using less standardised language from the period prior to O-RADS MRI implementation.
Clinical Implications and Future Integration
The hybrid model’s accuracy in classifying O-RADS 2 lesions suggests its utility in reducing unnecessary interventions. Correct identification of benign findings could prevent overtreatment or redundant imaging. Importantly, the study found that only a minority of MRI reports originally included an O-RADS score, underscoring the opportunity for automated tools to increase adoption. The hybrid model showed consistent performance even in reports lacking standard O-RADS terminology, suggesting its potential as a retrospective auditing or prospective decision support tool.
Integration of such models into dictation software or clinical systems could support radiologists by automating part of the scoring process and aligning reporting with expert standards. While external validation and broader generalisability remain necessary, this study demonstrates a clear path forward for implementing LLM-based applications that enhance diagnostic consistency. The transparency of the hybrid model further strengthens its appeal by enabling clinicians to understand the rationale behind the automated scores. Moreover, its strong performance across different lesion types and reporting periods indicates its potential resilience to regional variations in practice.
The study has demonstrated that large language models, when strategically optimised, can accurately assign O-RADS MRI scores from pelvic MRI reports. The hybrid approach, combining LLM feature classification with a deterministic formula, consistently outperformed both a simpler LLM-only model and original radiologist assessments. With high agreement to reference standards and adaptability across reporting styles, this hybrid method may support greater consistency and adoption of O-RADS MRI in clinical practice. Further evaluation in diverse clinical settings and application to other reporting systems would be a valuable next step.
Source: Radiology
Image Credit: Freepik