Radiology and pathology reports are central to clinical decision-making and research, yet their unstructured format often hinders automated data analysis. While previous natural language processing techniques have demonstrated some potential in extracting information, they often suffer from limited generalisability, high computational costs and privacy concerns when relying on commercial solutions.
The emergence of open-weight transformer-based language models offers a promising alternative, especially when combined with retrieval-augmented generation (RAG), a technique that enhances understanding by retrieving relevant text fragments during processing. A recent study assessed an automated pipeline employing state-of-the-art open-weight language models and RAG to extract structured data from clinical reports. The study evaluated various model configurations to determine how model type, size, quantisation, prompting strategy and retrieval approach affect performance when applied to real-world clinical data from radiology and pathology.
Performance of Models and Impact of Prompting Strategies
The automated pipeline was applied to two datasets: one comprising 7294 radiology reports annotated with Brain Tumor Reporting and Data System (BT-RADS) scores, and the other containing 2154 pathology reports annotated for IDH mutation status. Thirteen open-weight models, including Llama 2, Llama 3, MedLlama, Mistral, Mixtral and Phi3, were tested across 407 configurations. The best-performing models, notably OpenBioLLM-Llama-3 and Llama 3.1, achieved up to 98.7% accuracy in extracting BT-RADS scores and over 90% for identifying IDH mutation status. Larger and more recent models consistently outperformed older or smaller ones. Interestingly, the smaller Phi3 model performed nearly on par with the larger models in several scenarios, suggesting that architecture and training optimisation can offset limitations in size, making it suitable for environments with limited computing resources.
Prompt engineering played a pivotal role in performance. Complex prompts that included role descriptions and clear instructions for valid outputs yielded higher accuracy compared to simple prompts. Few-shot prompting, which provided the model with examples of correct and incorrect extractions, significantly enhanced results. Models using this method showed average accuracy improvements of over 30 percent. Moreover, including examples where the target data was not reported improved accuracy further. The benefits of prompt refinement highlight the importance of carefully designing instructions for clinical data extraction tasks.
Must Read: Scalable Semantic Search for Radiology Reports
The enforcement of JSON output formatting had a modest positive effect on accuracy in radiology tasks but showed limited impact on pathology reports. While JSON formatting helps standardise output and reduce variability, its benefits appear dependent on report complexity and the task being performed. This finding suggests that while output format standardisation is useful, it may not universally improve performance across all report types.
Retrieval and Sampling Parameters: When Complexity Matters
Retrieval-augmented generation was another focal point of the study. The approach was found to be highly beneficial for pathology reports, which are longer and more complex than radiology reports. The use of RAG improved performance in pathology tasks by an average of 48 percent. In contrast, applying RAG to the relatively short radiology reports resulted in a slight decrease in performance. This contrast indicates that RAG is more effective when the report structure is dense and information is dispersed across the text, allowing the retrieval component to isolate the most relevant context.
The pipeline’s RAG process used sentence-based text chunking, embedding with gte-large, and retrieval via cosine similarity in a vector database. A reranker then selected the most relevant text chunk using predefined keywords. These included phrases related to follow-up scores in radiology and mutation markers in pathology. If a minimum relevance score was not met, RAG was disabled. This ensured that the retrieval process only contributed when it could enhance model understanding.
Sampling parameters, including temperature, top-k and top-p, were also tested to understand their influence on performance. These parameters typically control the randomness of model output during inference. However, the study found no meaningful impact from adjusting these variables. Similarly, quantisation, which reduces model size and memory requirements by lowApplications, Limitations and Future Considerations ering the number of bits used for computations, had minimal effect on accuracy. These findings suggest that once a suitable model is selected and properly prompted, finer adjustments to sampling or quantisation may be less critical.
The study demonstrated that open-weight language models can achieve high accuracy in automated extraction of structured clinical information while preserving data privacy through local deployment. This has meaningful implications for health institutions seeking to develop in-house solutions for clinical research, reporting or database curation. The semi-automated optimisation approach adopted in the study—beginning with model evaluation on sampled reports and progressing to full dataset application—offers a practical strategy for resource-efficient model selection and deployment.
However, limitations remain. The reports originated from a single institution, which may limit generalisability to other settings with different report structures or terminologies. The variables extracted in this study were discrete and clearly defined, making them more amenable to automation. Tasks involving more ambiguous or subjective information may present additional challenges. Furthermore, while the configuration space explored was extensive, it was not exhaustive. Many parameters interact in complex ways, and future studies could investigate combinations not included in the current work.
Another important consideration is the evolving ecosystem of structured output formats. Although JSON was the primary focus in this study, alternatives such as XML or YAML might offer different advantages depending on the target use case. The growing capability of commercial LMs to support structured outputs might also influence future adoption patterns, though open models remain more flexible in terms of customisation and local control.
The study highlighted the strong potential of open-weight language models combined with retrieval-augmented generation for extracting structured data from unstructured clinical reports. Larger, newer and medically fine-tuned models offered the best results, while quantisation and sampling parameters had minimal impact. Prompt design was crucial, with few-shot and detailed prompts delivering significant improvements. Retrieval-augmented generation was especially beneficial for complex pathology reports.
These findings underscore the promise of open AI systems for clinical data structuring tasks, particularly when combined with domain expertise and semi-automated optimisation strategies. With further research and refinement, such tools may become integral to clinical research, quality assurance and reporting pipelines.
Source: Radiology: Artificial Intelligence
Image Credit: iStock