Rapid identification of critical radiology findings is essential as reporting volumes rise and language grows more nuanced. Traditional natural language processing depends on rules or large annotated datasets, which can miss context. General-purpose large language models (LLMs) offer a prompt-driven alternative that may work without retraining. Recent work assessed whether out-of-the-box LLMs can detect and categorise critical findings from radiology reports across modalities, comparing zero-shot, static few-shot and dynamic few-shot prompting. The evaluation focused on true, known or expected, and equivocal categories, reporting precision and recall against manual reference assessments and exploring practical implications and constraints.
Cohorts, Curation and Categories
A list of 102 critical findings, aligned with an actionable reporting framework, was expanded via a medical terminology resource to 210 terms. From 522,279 MIMIC-III reports, 440,537 contained an impression or conclusion. Regular expression screening using the 210 terms yielded 2715 reports, reduced to 2543 unique items after deduplication. Manual review identified findings beyond verbatim term matches and assigned each report to one of four types: at least one true critical finding, at least one known or expected critical finding without true findings, at least one equivocal finding without true or known or expected findings or no critical findings.
Must Read: AI in Breast Imaging: Gains in Screening, Caution Elsewhere
A stratified primary sample of 252 reports spanned CT 55.6%, radiography 29.4%, MRI 9.1%, ultrasound 3.9%, nuclear medicine 1.6% and angiography 0.3%, with head 22.2%, neck 5.1%, chest 35.3%, abdomen and pelvis 29.8% and other 7.5%. Across the 252 reports, 229 critical findings were recorded: 143 true, 34 known or expected and 52 equivocal. A 50-report tuning set supported prompt engineering, a 125-report holdout set underpinned testing, and 77 reports served as few-shot examples. Interobserver categorisation agreement in the holdout set was measured between a radiology resident and an experienced radiologist, with consensus used for analysis. An external test set of 180 chest radiography reports from CheXpert Plus included 92 critical findings: 58 true, 14 known or expected and 20 equivocal. Patient age and sex were not captured, and images were not reviewed.
Prompting Strategies and Model Results
Two models were evaluated: GPT-4 at temperature 0.0 and locally run Mistral-7B. Prompts requested structured lists of findings within the three categories, with optional explanations. On the tuning set, zero-shot, static few-shot and dynamic few-shot strategies were compared. Static few-shot used one to five fixed labelled examples. Dynamic few-shot selected five semantically similar examples per case using embeddings and a k-nearest neighbour approach. For true finding detection, automated metrics improved with more static examples: BLEU-1 rose from 0.691 with zero-shot to 0.778 with five static examples, ROUGE-F1 from 0.706 to 0.797 and G-Eval from 0.428 to 0.573. Dynamic few-shot gave intermediate values for true findings and modest advantages for known or expected findings. Equivocal findings showed lower absolute scores across strategies. The final prompt for testing used a static five-example configuration with explanations to prioritise true finding detection.
In the holdout set, two reports triggered a content filter, leaving 123 for evaluation. GPT-4 achieved precision and recall of 90.1% and 86.9% for true findings, 80.9% and 85.0% for known or expected findings and 80.5% and 94.3% for equivocal findings. Mistral-7B recorded 75.6% and 77.4% for true findings, 34.1% and 70.0% for known or expected findings and 41.3% and 74.3% for equivocal findings. In the external chest radiography set, GPT-4 reached 82.6% precision and 98.3% recall for true findings, 76.9% and 71.4% for known or expected findings and 70.8% and 85.0% for equivocal findings. Mistral-7B showed 75.0% and 93.1% for true findings, 33.3% and 92.9% for known or expected findings and 34.0% and 80.0% for equivocal findings.
Practical Utility and Boundaries
Results indicate that general-purpose LLMs can detect and classify multiple critical findings per report through in-context learning, with static few-shot prompting supporting the strongest performance for true findings and dynamic selection offering some benefits in specific categories. Potential use cases include surfacing true critical findings in finalised reports or identifying omissions before signing off. Differences between models likely reflect prompt tuning with GPT-4, model scale and instruction-following behaviour. Several factors limit interpretation: modest dataset sizes, a CT-heavy holdout set and a chest radiography-only external set, fewer known or expected and equivocal examples, reliance on automated text similarity metrics that may not capture domain phrasing and categorisation-focused interobserver analysis. Clinical context beyond report text was unavailable, and the retrospective design did not evaluate downstream impact.
Out-of-the-box LLMs guided by static few-shot prompts can support detection and categorisation of critical findings in radiology reports, particularly for true findings, while performance for known or expected and equivocal findings varies by strategy and model. The approach suggests a feasible path to augment safety checks and reporting workflows through prompt engineering rather than retraining, with further work needed on larger, more diverse datasets, incorporation of clinical context and prospective assessment before routine clinical adoption.
Source: American Journal of Roentgenology
Image Credit: Freepik