Out-of-the-Box LLMs Flag Critical Radiology Findings

In IT
Thu, 16 Oct 2025

Rapid identification of critical radiology findings is essential as reporting volumes rise and language grows more nuanced. Traditional natural language processing depends on rules or large annotated datasets, which can miss context. General-purpose large language models (LLMs) offer a prompt-driven alternative that may work without retraining. Recent work assessed whether out-of-the-box LLMs can detect and categorise critical findings from radiology reports across modalities, comparing zero-shot, static few-shot and dynamic few-shot prompting. The evaluation focused on true, known or expected, and equivocal categories, reporting precision and recall against manual reference assessments and exploring practical implications and constraints.

Cohorts, Curation and Categories

A list of 102 critical findings, aligned with an actionable reporting framework, was expanded via a medical terminology resource to 210 terms. From 522,279 MIMIC-III reports, 440,537 contained an impression or conclusion. Regular expression screening using the 210 terms yielded 2715 reports, reduced to 2543 unique items after deduplication. Manual review identified findings beyond verbatim term matches and assigned each report to one of four types: at least one true critical finding, at least one known or expected critical finding without true findings, at least one equivocal finding without true or known or expected findings or no critical findings.

Must Read: AI in Breast Imaging: Gains in Screening, Caution Elsewhere

A stratified primary sample of 252 reports spanned CT 55.6%, radiography 29.4%, MRI 9.1%, ultrasound 3.9%, nuclear medicine 1.6% and angiography 0.3%, with head 22.2%, neck 5.1%, chest 35.3%, abdomen and pelvis 29.8% and other 7.5%. Across the 252 reports, 229 critical findings were recorded: 143 true, 34 known or expected and 52 equivocal. A 50-report tuning set supported prompt engineering, a 125-report holdout set underpinned testing, and 77 reports served as few-shot examples. Interobserver categorisation agreement in the holdout set was measured between a radiology resident and an experienced radiologist, with consensus used for analysis. An external test set of 180 chest radiography reports from CheXpert Plus included 92 critical findings: 58 true, 14 known or expected and 20 equivocal. Patient age and sex were not captured, and images were not reviewed.

Prompting Strategies and Model Results

Two models were evaluated: GPT-4 at temperature 0.0 and locally run Mistral-7B. Prompts requested structured lists of findings within the three categories, with optional explanations. On the tuning set, zero-shot, static few-shot and dynamic few-shot strategies were compared. Static few-shot used one to five fixed labelled examples. Dynamic few-shot selected five semantically similar examples per case using embeddings and a k-nearest neighbour approach. For true finding detection, automated metrics improved with more static examples: BLEU-1 rose from 0.691 with zero-shot to 0.778 with five static examples, ROUGE-F1 from 0.706 to 0.797 and G-Eval from 0.428 to 0.573. Dynamic few-shot gave intermediate values for true findings and modest advantages for known or expected findings. Equivocal findings showed lower absolute scores across strategies. The final prompt for testing used a static five-example configuration with explanations to prioritise true finding detection.

In the holdout set, two reports triggered a content filter, leaving 123 for evaluation. GPT-4 achieved precision and recall of 90.1% and 86.9% for true findings, 80.9% and 85.0% for known or expected findings and 80.5% and 94.3% for equivocal findings. Mistral-7B recorded 75.6% and 77.4% for true findings, 34.1% and 70.0% for known or expected findings and 41.3% and 74.3% for equivocal findings. In the external chest radiography set, GPT-4 reached 82.6% precision and 98.3% recall for true findings, 76.9% and 71.4% for known or expected findings and 70.8% and 85.0% for equivocal findings. Mistral-7B showed 75.0% and 93.1% for true findings, 33.3% and 92.9% for known or expected findings and 34.0% and 80.0% for equivocal findings.

Practical Utility and Boundaries

Results indicate that general-purpose LLMs can detect and classify multiple critical findings per report through in-context learning, with static few-shot prompting supporting the strongest performance for true findings and dynamic selection offering some benefits in specific categories. Potential use cases include surfacing true critical findings in finalised reports or identifying omissions before signing off. Differences between models likely reflect prompt tuning with GPT-4, model scale and instruction-following behaviour. Several factors limit interpretation: modest dataset sizes, a CT-heavy holdout set and a chest radiography-only external set, fewer known or expected and equivocal examples, reliance on automated text similarity metrics that may not capture domain phrasing and categorisation-focused interobserver analysis. Clinical context beyond report text was unavailable, and the retrospective design did not evaluate downstream impact.

Out-of-the-box LLMs guided by static few-shot prompts can support detection and categorisation of critical findings in radiology reports, particularly for true findings, while performance for known or expected and equivocal findings varies by strategy and model. The approach suggests a feasible path to augment safety checks and reporting workflows through prompt engineering rather than retraining, with further work needed on larger, more diverse datasets, incorporation of clinical context and prospective assessment before routine clinical adoption.

Source: American Journal of Roentgenology

Image Credit: Freepik

References:

Talati IA, Zambrano Chaves JM, Das A et al. (2025) Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies. American Journal of Roentgenology: In Press

Radiology, medical imaging , healthcare innovation, AI, Clinical safety, large language models, GPT-4, prompt engineering, diagnostic support

Latest Articles

Hospitals of the Future: The Next Frontier in Patient-Centred Care
- Journal Article
- 18/10/2025
Hospitals are rapidly evolving into smart, connected ecosystems focused on proactive, personalised care. Leveraging AI, robotics, remote monitoring and digital health tools, they enhance diagnostics, improve workflows and support decentralised models like virtual wards. Predictive analytics, interoper
READ MORE
AI Orchestration in Emergency Radiology – Implementation in the Valencia Health Region
- Journal Article
- 18/10/2025
The Valencia Health Region deployed a vendor-neutral AI orchestration system across 29 hospitals to improve emergency radiology. Validated at Hospital General Universitario Dr Balmis, it streamlines triage, accelerates diagnoses and reduces radiologists’ workload. The system processes over 5,700 studi
READ MORE
Advancement of 3D Printing in Healthcare and Its Impact on Sustainability
- Journal Article
- 18/10/2025
3D printing is transforming healthcare through personalised devices, surgical precision and faster prototyping while advancing sustainability. On-demand production reduces waste, supports circular economy models and lowers carbon footprints by minimising transport and inventory. Despite its promise,...
READ MORE

radiology AI, large language models, LLMs in healthcare, critical findings detection, radiology reporting, prompt engineering, diagnostic AI, medical imaging analysis, GPT-4 in radiology, healthcare automation, digital transformation in imaging Out-of-the-box LLMs accurately flag and categorise critical radiology findings, enhancing safety and reporting efficiency.

Out-of-the-Box LLMs Flag Critical Radiology Findings

References:

Latest Articles

Related Articles

Latest News

INFO

IMAGING

ICU

EXEC

IT

CARDIOLOGY

JOURNALS

EVENTS

FACULTY

PARTNERS

JOBS

COMPANIES

PRODUCTS

BLOG

VIDEOS

Communities

CONTACT US

EU Office

Rue Villain XIV 53-55

B-1050 Brussels, Belgium

Tel: +357 86 870 007

E-mail: [email protected]

EMEA & ROW Office

166, Agias Filaxeos

CY-3083, Limassol, Cyprus

Tel: +357 86 870 007

E-mail: [email protected]

Headquarters

Kosta Ourani, 5

Petoussis Court, 5th floor

CY-3085 Limassol, Cyprus

E-mail: [email protected]