Automated LLM Labelling Drives Multi-Label Radiography

In Imaging
Sun, 23 Nov 2025

Automating the conversion of routine radiology reports into structured training data remains a bottleneck for imaging AI. An evaluation of a large language model (LLM) applied to upper extremity radiography examined whether zero-shot label extraction from free-text reports could provide accurate, uncertainty-aware annotations for multi-label image classification. Radiography series of the clavicle (n=1170), elbow (n=3755) and thumb (n=1978) were processed after anonymisation, with labels assigned as present, absent or uncertain. Extracted labels then trained convolutional neural networks (CNNs) for each anatomical region. Accuracy was verified on internal and external test sets, and the influence of handling uncertainty during training was assessed. The approach sought to accelerate dataset creation while maintaining diagnostic performance and generalisability across sites.

Two-Centre Pipeline from Reports to Training Labels

A retrospective, two-centre design combined radiography and corresponding reports from a university hospital for internal training, validation and testing and an external hospital for independent testing. Reports were in German and anonymised before processing. OpenAI’s GPT-4o operated in a zero-shot mode to complete predefined JavaScript Object Notation templates for each region, designed by a senior musculoskeletal radiologist to capture frequent and less common conditions. For every condition, the LLM selected true, false or uncertain based on report wording.

Manual verification was performed for all internal and external test sets. Internal test cohorts comprised 233 clavicle reports, 745 elbow reports and 393 thumb reports, external cohorts comprised 300 reports per region. In the test sets, all labels were finalised as true or false, with follow-up imaging used where needed. Across the combined test sets, automatic extraction was correct in 98.6% of labels (60,618 of 61,488). Region-specific label-level accuracies for the external cohorts reached 98.6% for clavicle, 98.4% for elbow and 98.1% for thumb, with report-level accuracies between 71.3% and 73.7%. Internal cohorts showed similar label-level accuracies of 98.6% to 99.0%, with report-level accuracies between 74.4% and 85.5%.

Label uncertainty was present but relatively infrequent. In internal cohorts, manual review identified uncertain wording in 3.9% of clavicle reports, 10.5% of elbow reports and 9.7% of thumb reports, while the extraction pipeline automatically flagged 0.9%, 6.4% and 5.3%, respectively. External cohorts showed a similar pattern, with uncertainty present in 5.3% of clavicle, 16.3% of elbow and 16.0% of thumb reports, of which 3.3%, 8.7% and 13.3% were detected automatically.

Handling Uncertainty with Inclusive and Exclusive Strategies

To test the operational impact of uncertainty during model development, two training strategies were compared while keeping all other parameters constant. In the inclusive approach, labels originally marked as uncertain in training and validation were reassigned to true. In the exclusive approach, uncertain labels were reassigned to false. All test sets contained only definitive labels to allow unbiased evaluation. The volume of uncertain labels in training and validation was limited (42 for clavicle, 492 for elbow and 231 for thumb).

Models were implemented in PyTorch using a modified ResNet50 backbone configured for multi-label output with sigmoid activation. For elbow and thumb, anteroposterior and lateral projections were processed by separate networks with feature concatenation before classification. Standard augmentation and resizing to 512×512 pixels were applied. Operating thresholds on the test sets were chosen using the Youden index derived from validation performance.

Must Read: Out-of-the-Box LLMs Flag Critical Radiology Findings

Comparisons used macro-averaged receiver operating characteristic area under the curve (AUC), calculated across labels with at least 10 positive cases per test set. Statistical testing employed the DeLong method with Benjamini–Hochberg correction. Across regions and datasets, no significant AUC differences were observed between inclusive and exclusive training or between internal and external testing (p ≥ 0.15). This indicated that the chosen treatment of uncertain labels during training did not materially alter downstream diagnostic performance in this setting.

Performance and Generalisation Across Clavicle, Elbow and Thumb

Performance varied by region and label prevalence but remained competitive overall. For clavicle, macro-averaged AUC reached 0.80 (range 0.59–0.95) for inclusive training and 0.81 (0.63–0.94) for exclusive training. For elbow, macro-averaged AUC was 0.80 (0.62–0.87) for inclusive and 0.80 (0.61–0.88) for exclusive. For thumb, macro-averaged AUC was 0.76 (0.59–0.91) for inclusive and 0.78 (0.61–0.90) for exclusive. External generalisation was maintained, exemplified by elbow models reaching 0.79 macro-averaged AUC for both strategies.

Label-level behaviour reflected class balance and visual conspicuity. Fracture-related labels and displacement exhibited high AUCs where positive case counts were larger. Rarer or subtler findings, particularly soft-tissue abnormalities and small ossicles, showed lower or less consistent AUCs and wider confidence intervals. Threshold-dependent metrics like sensitivity and specificity varied by label, with some trade-offs evident at the single Youden-selected operating point, yet threshold-independent AUC comparisons remained stable across strategies and sites.

The uncertainty detection gap between manual review and automated extraction did not translate into measurable AUC differences. Given the low absolute proportion of uncertain labels in training and validation, the effect size of reassigning them as positive versus negative appears limited under the tested conditions. Together with the high label-level extraction accuracy, these results support the feasibility of zero-shot LLM labelling for assembling multi-label training datasets from routine reports in upper extremity radiography.

Zero-shot LLM extraction produced high-accuracy structured labels from routine radiology reports and enabled competitive multi-label CNNs for clavicle, elbow and thumb radiography. Performance generalised to an external site, and alternative strategies for handling uncertain wording during training did not significantly affect results. Stronger performance for common fracture-related findings and weaker performance for rarer soft-tissue labels underline the continued importance of class balance and case availability. The workflow provides a scalable way to create training labels from routine reports, shortening data preparation cycles for imaging AI and retaining clinically relevant performance.

Source: European Radiology

Image Credit: iStock

References:

Kreutzer H, Caselitz AS, Dratsch T et al. (2025) Large language model-based uncertainty-adjusted label extraction for artificial intelligence model development in upper extremity radiography. Eur Radiol: In Press.

medical imaging, Deep Learning, radiology AI, LLM in healthcare, clinical automation, Computer Vision, Data annotation

Latest Articles

Hospitals of the Future: The Next Frontier in Patient-Centred Care
- Journal Article
- 18/10/2025
Hospitals are rapidly evolving into smart, connected ecosystems focused on proactive, personalised care. Leveraging AI, robotics, remote monitoring and digital health tools, they enhance diagnostics, improve workflows and support decentralised models like virtual wards. Predictive analytics, interoper
READ MORE
AI Orchestration in Emergency Radiology – Implementation in the Valencia Health Region
- Journal Article
- 18/10/2025
The Valencia Health Region deployed a vendor-neutral AI orchestration system across 29 hospitals to improve emergency radiology. Validated at Hospital General Universitario Dr Balmis, it streamlines triage, accelerates diagnoses and reduces radiologists’ workload. The system processes over 5,700 studi
READ MORE
Advancement of 3D Printing in Healthcare and Its Impact on Sustainability
- Journal Article
- 18/10/2025
3D printing is transforming healthcare through personalised devices, surgical precision and faster prototyping while advancing sustainability. On-demand production reduces waste, supports circular economy models and lowers carbon footprints by minimising transport and inventory. Despite its promise,...
READ MORE

radiology AI, large language model, zero-shot learning, structured medical data, multi-label classification, upper extremity radiography, CNN training, medical imaging automation Automating the conversion of routine radiology reports into structured training data remains a bottleneck for imaging AI. An evaluation of a large lan...

Automated LLM Labelling Drives Multi-Label...

Radiomic Trajectory Quantifies Early Lung...

MRI Oedema Patterns Support Triage in Acute...

MRI Habitat Maps Flag High-Grade Prostate Can

Automated LLM Labelling Drives Multi-Label Radiography

References:

Latest Articles

Hospitals of the Future: The Next Frontier in Patient-Centred Care

AI Orchestration in Emergency Radiology – Implementation in the Valencia Health Region

Advancement of 3D Printing in Healthcare and Its Impact on Sustainability

Latest News

INFO

IMAGING

ICU

EXEC

IT

CARDIOLOGY

JOURNALS

EVENTS

FACULTY

PARTNERS

JOBS

COMPANIES

PRODUCTS

BLOG

VIDEOS

Communities

CONTACT US

EU Office

Rue Villain XIV 53-55

B-1050 Brussels, Belgium

Tel: +357 86 870 007

E-mail: [email protected]

EMEA & ROW Office

166, Agias Filaxeos

CY-3083, Limassol, Cyprus

Tel: +357 86 870 007

E-mail: [email protected]

Headquarters

Kosta Ourani, 5

Petoussis Court, 5th floor

CY-3085 Limassol, Cyprus

E-mail: [email protected]