Automated labelling of pulmonary embolism CT findings can support clinical research, registries and quality improvement, but manual annotation remains time-consuming and exact-match rules can fail when structured text is edited after reporting. A single-centre retrospective diagnostic accuracy analysis published in European Radiology Experimental compared radiologists, a rule-based extractor and large language models for extracting structured pulmonary embolism CT information. The strongest results came from a hybrid workflow that used rules first and sent only failed fields to a large language model.

 

Structured Reports and Shared Labelling Rules

The dataset included 2,923 CT pulmonary angiography outputs, with one output retained per patient. Each output contained a fixed set of structured fields covering general information, pulmonary embolism findings, additional findings and the overall impression. The pulmonary embolism section included thrombus burden, clot-burden score, perfusion deficits on dual-energy CT and right-to-left ventricle ratio.

 

The rule-based extractor used exact-match rules to convert predefined fields into a standardised table. Fields were classified as missing when information was absent or could not be inferred from the template. Fields were classified as invalid when information was present but did not fit the permitted schema. These invalid cases included free-text additions, mixed entries, non-permitted categories and contradictory combinations related to pulmonary embolism status.

 

The reference standard was defined at the level of finalised text, not source images. Two attending radiologists reviewed rule-based failures and mapped the text into the predefined schema when possible. Eight radiologists also labelled the full dataset through a structured interface. All readers and automated pipelines therefore worked within the same constrained label space.

 

LLMs Outperform Rules Alone

The rule-based extractor achieved high accuracy overall, but it performed less well than the strongest large language model pipelines. Its main weaknesses appeared in fields affected by variable phrasing, post hoc edits and wording that moved away from the pulmonary embolism template. These deviations could break exact-match parsing even when the information itself was present in the finalised output.

 

Must Read: Automated LLM Labelling Drives Multi-Label Radiography

 

The strongest standalone large language models reached substantially higher performance. GPT-4.1-mini and Falcon3-10b achieved similar item-level performance, with F1 scores of 0.98, and both exceeded the rule-based extractor. Several open-weight models also reached or exceeded the pooled radiologist baseline, including Falcon3-10b and selected Qwen3 variants. GPT-4.1-mini reached comparable performance among proprietary models.

 

The radiologist baseline showed strong consistency. Across all readers, pooled item-level performance reached an F1 score of 0.92 and accuracy was high. Inter-reader agreement and intra-reader reproducibility were also high. However, radiologist labelling required more than 32 hours for the full cohort, underlining the practical burden of manual cohort curation even when the labelling task is structured.

 

Hybrid Workflow Delivers the Best Performance

The hybrid workflow used the rule-based extractor as the first step. Valid rule-based fields passed through unchanged. Invalid fields were routed to a large language model through a structured interface. Missing fields were not routed to a model and were excluded from performance analyses. The final output combined valid rule-based labels with model-salvaged labels from rule failures.

 

This staged approach produced the strongest overall results. Median item-level F1 increased from 0.81 with rules alone to 0.99 with the hybrid workflow, while accuracy reached 99.8%. The hybrid workflow exceeded both the rule-based extractor and the pooled radiologist label-transfer baseline.

 

The approach also reduced the number of large language model calls by 85.6%, because only outputs with rule-based failures required model processing. Median cohort labelling time fell to about one hour for the hybrid configurations. Standalone large language model extraction took longer, while radiologist labelling remained the most time-intensive option. Residual hybrid errors were rare and mainly involved pulmonary embolism-related fields, including embolus localisation, thrombus burden, right-to-left ventricle quotient and perfusion deficit.

 

Deployment Depends on Local Priorities

The task focused on extraction from finalised structured text rather than diagnostic reasoning or image-based truth adjudication. The key question was how accurately structured pulmonary embolism information could be transferred into a predefined schema. In that setting, rules performed well when the wording remained schema-conformant, while large language models helped recover information when text deviated from the template.

 

The structured reporting environment sits between fully standardised templates and free-text reporting. Exact-match rules are fast, transparent and easy to audit, but they are vulnerable to small deviations in finalised text. Large language models provide more flexibility, but they require model access, computational resources and governance arrangements. A rules-first approach reserves model use for the smaller group of cases where rules fail.

 

Open-weight models performed similarly to proprietary GPT-4 variants in this template-constrained task. Local inference may therefore be relevant for institutions prioritising data governance and control, provided suitable hardware and support are available. The local setup used high-end GPU infrastructure, so it does not represent routine baseline resources for every radiology department. Cloud-hosted systems may reduce local maintenance while maintaining competitive performance.

 

A hybrid rules-first large language model workflow achieved the strongest results for labelling finalised structured pulmonary embolism CT outputs. The approach retained the speed and auditability of rule-based extraction while using model-based salvage for fields that rules could not resolve. Standalone large language models outperformed rules alone, but targeted model use delivered the best balance of accuracy, processing time and efficiency. The results support structured, selective automation for pulmonary embolism CT cohort curation.

 

Source:  European Radiology Experimental

Image Credit: iStock




Latest Articles

Pulmonary Embolism CT, LLM Radiology, AI Medical Imaging, CT Labelling Automation, GPT-4.1 Mini, Radiology AI, Structured Reporting, Healthcare AI, Medical Data Extraction, European Radiology Experimental Hybrid LLM and rule-based workflow improves pulmonary embolism CT labelling accuracy to 99.8%, reducing annotation time for research.