Growing imaging demand and the complexity of interventional stroke documentation place sustained pressure on radiology workflows. Mechanical thrombectomy reports contain technical details, angiographic findings and outcome classifications that clinicians need quickly and accurately. An evaluation of seven open-source large language models assessed their ability to generate concise, clinically faithful summaries of thrombectomy reports and to infer angiography-based outcomes from text alone. The analysis covered quantitative text-similarity metrics and a manual review by radiologists focusing on correctness and completeness across outcome scores, vessels, laterality, pass counts, additional information, hallucinations and grammar. Results suggest broadly similar performance across models, moderate accuracy for outcome prediction and error patterns that matter for safe clinical use.
Dataset, Training and Evaluation Design
The evaluation drew on 2000 German-language neurointerventional reports for acute ischaemic stroke treated with mechanical thrombectomy, collected between September 2013 and August 2024. Reports followed a standard structure with a detailed findings section and a summarising impression. For model development, 1900 reports were used to fine-tune each LLM with Low-Rank Adaptation, and 100 reports formed the held-out test set. During testing, models received only the findings sections and were prompted to generate an impression that emphasised the Thrombolysis In Cerebral Ischaemia (TICI) scale, including the expanded 2c grade.
Must Read: LLM-Simplified CT Reports Improve Comprehension
Seven open-source LLMs were fine-tuned: Meta-Llama-3.1-8B, mistral-7b-instruct, gemma-2-9b, Llama-3.2-1B, Llama-3.2-3B, OpenBioLLM-8B and BioMistral-7b. All used 4-bit quantisation and a shared training configuration. Quantitative performance against the original impressions used ROUGE-1, ROUGE-2, ROUGE-L, METEOR, BERTScore (F1) and BLEU. A manual analysis of the four best-scoring models then compared generated summaries to the original findings for correctness and completeness across key variables. Statistical testing used χ² procedures with post hoc comparisons.
Headline Results and Error Patterns
Across quantitative metrics, models performed similarly. BioMistral-7b scored highest on most measures, including ROUGE-1 at 0.47, ROUGE-2 at 0.30 and ROUGE-L at 0.43, and achieved the top METEOR value of 0.46. BERTScore (F1) values were tightly clustered at 0.81–0.82, with BioMistral-7b and mistral-7b-instruct showing the best semantic alignment. BLEU values were generally low, with small differences. OpenBioLLM-8B trailed the group on all quantitative metrics. According to the bar charts on page 7, these differences were modest in magnitude, reinforcing the overall convergence after task-specific fine-tuning.
Manual evaluation of the top four models highlighted practical distinctions. Correct derivation of the TICI score from text alone ranged from 66.27% to 71.08%, with Meta-Llama-3.1-8B achieving 71.08%. Recanalised vessels were reported in 71–84% of summaries, with 59.15–64.29% of those mentions fully correct, BioMistral-7b reached the highest reporting rate at 84% and the highest correctness rate at 64.29%. Laterality was included in 54–64% of summaries and was correct in 70.37–78.69% of those mentions, with gemma-2-9b at the upper end of correctness.
Pass counts exposed sharper contrasts. Gemma-2-9b most frequently documented pass counts, mentioning them in 56% of cases, while mistral-7b-instruct reported the highest accuracy among the cases where passes were mentioned, with 29 of 38 correct (76.32%). Both differences were statistically significant in their respective comparisons. Relevant additional information, including complications, was unevenly captured and relatively infrequent, appearing in 24.53–37.74% of summaries, with complete correctness varying across models.
Hallucinations occurred at moderate and similar rates across models, affecting 23–25% of summaries. The most common error types were incorrect side attribution for vessels, accounting for 44.74% of hallucinations, and invented or misassigned recanalised vessels and segments at 30.70%. Less frequent but noteworthy issues included erroneous pass counts, misstated groin-to-perfusion timing and mischaracterised occlusion extent. Grammar errors were uncommon overall at 12–21%, though long and complex source texts could lead to truncated outputs in some cases.
Implications for Workflow Integration and Limitations
Findings indicate that fine-tuned open-source LLMs can generate concise, clinically aligned impressions from detailed procedural findings, with moderate success at inferring TICI grades from text alone. The clustered quantitative scores suggest that domain-specific fine-tuning narrows performance gaps among architectures of similar scale, while manual review reveals specific strengths and weaknesses that bear on deployment choices. For example, pass-count reporting and accuracy differ between models, and additional information is inconsistently prioritised, which matters for downstream decision-making.
Error profiles are particularly relevant for safe integration. Laterality errors and vessel misassignments dominated hallucinations, reflecting how often these attributes appear in thrombectomy narratives and the variety of segment terminology. Given these patterns, safeguards such as confidence scoring, constrained vocabularies aligned to validated anatomical terms, retrieval-augmented prompts that anchor vessel and laterality details to the source text, and expert validation before sign-off are likely to improve reliability. In routine settings, automatic draft impressions and structured data fields for vessel, side, TICI and pass count could streamline reporting and support registries, quality assurance and research, provided a radiologist verifies outputs to prevent propagation of incorrect parameters.
Several constraints temper generalisability. The evaluation reflects single-centre reporting styles and a modest test set of 100 cases. The models were 4-bit quantised, which improves efficiency but may limit context fidelity for complex texts. Original impressions used for training and metric references varied in completeness. The prompt emphasised TICI definitions but did not include exemplar summaries, which might have further standardised outputs.
Task-specific fine-tuning enabled several open-source LLMs to produce useful summaries of mechanical thrombectomy reports and to infer TICI grades from textual findings with moderate accuracy. Quantitative performance converged across models, while manual review exposed targeted strengths and recurrent error modes, especially around laterality and vessel identification. With appropriate safeguards, human oversight and structured extraction, these tools could support faster, more consistent reporting in acute stroke pathways, while further refinement should focus on contextual understanding, handling of ambiguous inputs and reduction of hallucinations.
Source: European Radiology
Image Credit: iStock