The accurate and timely communication of critical findings from radiology reports is vital to effective patient care. Traditionally, this task has relied on human effort, with radiologists manually identifying and reporting urgent observations. However, radiology reports are often unstructured, inconsistent and lengthy, posing significant challenges for manual review and quality assurance. While previous approaches have used rule-based or supervised machine learning systems to address this issue, these methods have proven limited in scalability and generalisability. To overcome these challenges, researchers have introduced a weakly supervised, instruction-tuned pipeline based on Mistral-class large language models (LLMs) that can automatically identify and extract critical findings from diverse radiology reports without the need for fully annotated datasets.
Weakly Supervised Pipeline Development
The new pipeline introduces a two-phase training approach designed to overcome the scarcity of annotated data for model fine-tuning. In Phase I, weak labels are generated using instruction-tuned Mistral-based LLMs. These models are prompted with definitions and task instructions to extract critical and incidental findings from unlabelled radiology reports. Both zero-shot and few-shot prompting methods are employed, with few-shot examples improving alignment with task objectives. Extracted text is then mapped to a curated list of critical finding terms, expanded from established medical ontologies and institutional reporting standards.
Must Read: Harnessing LLMs for Medical Data Extraction
Phase II involves weakly supervised fine-tuning using the labels generated in Phase I. Reports are paired with their corresponding extracted terms to train the model in a format aligned with instruction tuning. The system uses efficient training configurations, including low-rank adaptation (LoRA), to update model parameters. This enables the fine-tuned model to learn from task-specific patterns and improve its ability to identify critical terms across report types, imaging modalities and anatomical regions. Notably, both general domain (Mistral-7B) and biomedical domain (BioMistral-7B) models are evaluated to assess their effectiveness in this context.
Performance Evaluation Across Datasets
Model performance is assessed using internal Mayo Clinic data and external MIMIC-III and MIMIC-IV datasets. On the small-scale internal Mayo test set, the fine-tuned Mistral model achieves a 48% Rouge-2 score, outperforming its pre-trained counterpart. On the external MIMIC-III dataset, it scores 59%, again demonstrating superior performance. BioMistral models, while domain-specific, generally underperform due to limited vocabulary coverage and lack of exposure to radiology-specific report structures.
Evaluation metrics include both traditional human-annotated comparisons and novel LLM-aided evaluations using G-Eval and Prometheus scoring. These metrics consistently show that weakly supervised models outperform pre-trained baselines. Few-shot prompting enhances model precision, recall and F1-scores, with BioMistral-WFT achieving an F1-score of 0.57 on both internal and external test sets. The pipeline is further tested on 5000 MIMIC-IV reports, with LLM-based metrics supporting the results observed in human-annotated tests. Despite the promising results, performance remains moderate, highlighting the complexity of parsing nuanced clinical language and the challenge of distinguishing between chronic, new and absent findings.
Challenges and Implications
While the system performs well in extracting clearly indicated findings, it struggles with reports containing subtle linguistic cues or negative findings. For example, sentences indicating the absence of critical issues may lead to false positives. Additionally, reports with chronic findings that are no longer clinically urgent often confuse the model. These limitations stem from the inherent ambiguity in radiology language and the unsupervised nature of the initial label generation.
Nevertheless, the pipeline presents a significant advance over previous rule-based and supervised approaches. It is the first system capable of extracting diverse critical findings across report types and imaging modalities without requiring manual annotations. This scalability opens avenues for integration into institutional quality monitoring and compliance workflows. Moreover, it could help reduce the risk of missed findings, facilitate timely communication and support retrospective data analysis. The publicly available pipeline and model further promote research and clinical adoption.
The development of a weakly supervised, instruction-tuned LLM pipeline marks a pivotal step toward automating the extraction of critical findings from radiology reports. By leveraging few-shot prompting and domain adaptation techniques, the system achieves improved performance across diverse datasets without the need for annotated training data. Although limitations remain in handling ambiguous or negative cases, the approach demonstrates strong potential for enhancing clinical decision-making, streamlining communication and supporting secondary use cases such as retrospective analysis. Future iterations of the technology may include improved context sensitivity and integration with imaging data, further solidifying its role in clinical radiology workflows.
Source: npj digital medicine
Image Credit: iStock