Large language models (LLMs) become increasingly present in clinical documentation workflows, including interpretation of radiology reports and structured data extraction. Their ability to process narrative medical text raises important questions about reliability when numerical interpretation and clinical thresholds are involved. Radiology reports often combine measurements, comparisons and clinical context in unstructured language, requiring both numerical reasoning and domain understanding. An evaluation of several open-weight and proprietary LLMs examined performance across multiple radiology report tasks involving measurement extraction and clinically relevant numerical interpretation. The results highlight strengths in structured extraction tasks while also revealing areas where domain-specific complexity continues to challenge model performance.

 

Numerical Extraction from Radiology Reports

Several tasks focused on extracting quantitative information from radiology report text. These included identifying the largest lung nodule measurement described in computed tomography reports, extracting common bile duct diameter values from ultrasound reports and identifying the current minimum T-score from dual-energy X-ray absorptiometry reports. Each task required selecting the correct measurement among multiple values reported in clinical narrative form.

 

Across these extraction tasks, model performance was generally strong. Most reasoning-oriented models achieved accuracy levels above approximately 95%, with the highest-performing model approaching complete accuracy in some extraction scenarios. A non-reasoning baseline model showed more variation, particularly in ultrasound measurement extraction, where accuracy was noticeably lower than that of the other models. Extraction of T-score values showed consistently high performance across models, with only minor variation between systems.

 

Must Raed:Causal AI Improves Colorectal Pathology Robustness

 

The reports used for evaluation were drawn from both an institutional radiology report database and the MIMIC-III dataset. Ground truth values were manually extracted by independent reviewers and discrepancies were resolved through consensus. Evaluation focused on whether the extracted value matched the reference answer, regardless of additional explanatory text produced by the models.

 

Clinical Judgement Based on Numerical Criteria

Additional tasks required interpreting extracted values using clinical thresholds. These included identifying highly hypermetabolic findings in positron emission tomography reports based on SUVMax thresholds greater than 5, determining osteoporosis status using T-score criteria that incorporated age considerations and identifying bile duct dilation using age-adjusted diameter thresholds beginning at 6 mm for younger adults and increasing with age.

 

Performance differences between models became more visible in these judgement tasks. Reasoning-oriented models showed consistently high accuracy, including near-perfect results in osteoporosis classification tasks and strong performance in bile duct dilation assessment. Identification of hypermetabolic findings showed the widest variation across models. The baseline model demonstrated substantially lower accuracy in that task, largely due to incorrect assumptions about reported SUVMax values.

 

When impression sections were removed from reports during evaluation, models relied entirely on descriptive findings and measurements within the body of the report. This condition tested whether models could correctly interpret numerical criteria without summarised clinical conclusions. Reasoning models remained stable under these conditions, while the baseline model showed greater variability.

 

Effects of Output Constraints and Error Patterns

A secondary experiment evaluated model performance under strict formatting requirements that allowed only the final numerical answer without additional explanation. Under these constraints, performance differences became more pronounced. The baseline model and the distilled reasoning model showed clear declines in extraction accuracy, particularly for T-score and bile duct measurement tasks. In contrast, reinforcement learning–trained reasoning models maintained high accuracy even when required to produce answer-only outputs.

 

Manual review of incorrect responses revealed recurring error patterns linked more to medical text interpretation than to mathematical operations. In the most advanced reasoning models, no apparent calculation or comparison errors were identified. Instead, mistakes were associated with report structure and terminology. Examples included confusion between anatomical structures, selection of incorrect measurements when multiple numbers were present and misinterpretation of image or series identifiers as clinical values.

 

In the hypermetabolic detection task, one model frequently inferred SUVMax values that were not present in the report text. Other observed issues included missing relevant measurements, incomplete responses and incorrect application of task criteria. These findings suggest that domain-specific formatting and terminology remain important sources of error even when numerical reasoning is stable.

 

The presence or absence of explicit target values within a report affected performance primarily in the baseline model. Reasoning-oriented systems showed less sensitivity to this factor, suggesting stronger robustness when relevant measurements were embedded within complex report narratives.

 

Performance across radiology numerical reasoning tasks demonstrated that reasoning-oriented LLMs can achieve high accuracy in both measurement extraction and clinical interpretation scenarios. Extraction tasks produced consistently strong results across most models, while judgement tasks highlighted clearer differences in reliability. Strict output-format requirements exposed additional performance gaps in non-reasoning models but had minimal effect on reinforcement learning–trained reasoning systems. Remaining errors were linked mainly to radiology report conventions and medical terminology rather than numerical calculation itself. These findings emphasise the importance of aligning task design, report structure and model capabilities when applying LLMs to radiology reporting workflows.

 

Source: Journal of Imaging Informatics in Medicine

Image Credit: iStock


References:

Nowroozi A, Bondarenko M, Serapio A et al. (2026) Large Language Models in Radiologic Numerical Tasks: A Thorough Evaluation and Error Analysis. J Digit Imaging Inform med: In Press.




Latest Articles

LLMs radiology, clinical AI models, radiology report extraction, numerical reasoning healthcare, medical AI accuracy, imaging informatics, AI clinical workflows Study explores LLM accuracy in radiology report extraction and clinical interpretation, highlighting strengths and limits in numerical reasoning tasks.