Vision-Language Models (VLMs) combine image recognition with text understanding and are increasingly promoted for clinical workflows. In neuroradiology, that promise depends on whether a model can interpret brain and spine images accurately, handle brief clinical context and avoid unsafe recommendations. An evaluation using 100 neuroradiology cases compared several widely used VLMs with experienced neuroradiologists, focusing on diagnostic accuracy, the safety of the most likely diagnosis and the main types of errors. The results show a clear gap between model outputs and specialist performance, with a non-trivial share of model answers judged capable of causing clinical harm, especially through missed time-sensitive conditions and incorrect classification.
Accuracy Remains Well Below Specialist Reading
Five VLMs were assessed: three commercial systems (Gemini 2.0, OpenAI o1 and Grok-2-vision) and two open-source models (Llama 3.2 90b and Qwen 2.5). They were asked to interpret cases using short clinical presentations and selected images and to provide a most probable diagnosis plus key differentials. Neuroradiologists, reviewing the same cases, achieved a mean accuracy of 86.2% for the most probable diagnosis.
All VLMs performed substantially below that benchmark. The best-performing model, Gemini 2.0, reached 35% accuracy for the most probable diagnosis, while other models scored lower, with the weakest model far below the rest. Looking beyond a single answer helped, but not enough. When the correct diagnosis was counted as correct if it appeared anywhere in the top three differential diagnoses, the strongest model improved to 52%, still well short of specialist performance.
Must Read: AI Advances Reshape Neuro-Oncology from Imaging to Care
Performance was uneven across diagnostic categories. Congenital and developmental disorders stood out as particularly difficult for all models, with very low accuracy reported in that category. The cases covered a wide mix of neuroradiology conditions, including tumours, vascular disease, infections, trauma and metabolic or toxic disorders. Most cases used MRI, reflecting typical neuroradiology imaging practice.
Harmful Outputs Cluster Around Missed Urgency and Misclassification
Accuracy alone does not capture clinical risk, so the evaluation also classified whether the model’s most probable diagnosis could plausibly cause harm. Harm was defined using three categories: treatment delay for time-sensitive conditions, misclassification between benign and malignant disease and overdiagnosis that could lead to unnecessary invasive procedures.
Across the VLMs, harmful outputs were common, with overall harm rates ranging from 28% for the least harmful model to 45% for the most harmful. Neuroradiologists had a lower combined harm rate of 15%. Among model errors, treatment delay was the dominant risk, with rates spanning the mid-teens up to 28% depending on the model. Misclassification was also frequent and, in some models, reached around one-fifth of cases. Overdiagnosis was rare in comparison, appearing only sporadically.
A notable pattern was that higher diagnostic accuracy did not necessarily translate into lower risk. One model ranked near the top for accuracy while still producing harmful outputs in more than a third of cases. This mismatch matters for clinical deployment because a system that occasionally “gets it right” can still be unsafe if its confident mistakes systematically steer care towards delayed treatment or inappropriate pathways.
Errors Reflect Core Weaknesses in Neuroradiology Interpretation
Error analysis identified recurring failure modes that align closely with neuroradiology’s day-to-day demands. Problems fell into five broad areas: incorrect anatomical localisation, inaccurate description of imaging findings, misidentification of imaging modality or sequence, hallucinated findings and overlooked pathologies.
Anatomical localisation was a prominent weakness, particularly for lower-performing models. Confusions included mislabelling left and right, mixing brain regions and misassigning lobar location, such as attributing pathology to the wrong lobe. In neuroradiology, where localisation can determine differential diagnosis and management, these errors can directly affect clinical decision-making.
Imaging sequence recognition was another consistent challenge. Several models tended to default MRI interpretation to common sequences and struggled with recognising susceptibility-weighted images. That matters because susceptibility effects can be key for interpreting certain vascular and haemorrhagic processes. Misidentification of sequences can therefore cascade into wrong diagnoses, even when the visible abnormality is present.
Hallucinated findings were also observed. In these instances, models described abnormalities that were not present on the images, such as signs associated with specific pathologies even when the scan did not support them. This failure mode is particularly concerning in imaging, where downstream steps may include invasive work-up, urgent referrals or treatments based on presumed radiological evidence. Alongside hallucinations, overlooked pathologies remained common: existing abnormalities were missed or underweighted in the reasoning, leading to incorrect conclusions.
The strongest models tended to make fewer gross anatomical errors but still frequently produced inaccurate imaging descriptions and missed important findings. Lower-performing models showed heavier concentrations of anatomical mistakes and hallucinations. One model had the highest prevalence of hallucinated findings and also frequently misidentified modality or sequences, reinforcing its higher harm rate.
Why Brief Clinical Context Exposes Model Limitations
The cases included short clinical presentations, often limited to age, sex and a brief summary. The average presentation length was just over nine words. This reflected high-throughput clinical environments but left little textual scaffolding for the models to lean on. The evaluation also explored whether the number of images per case affected model performance. Across all models, the relationship between image count and accuracy was negative, but a clearer signal emerged: more images tended to appear in more complex cases. Complexity, rather than image quantity itself, appeared to be the more plausible explanation for declining performance.
The design also differed from educational-style neuroradiology quizzes that include detailed textual descriptions of imaging findings. In settings where scans are paired with rich narrative interpretation, models may appear more capable because the text supplies much of the diagnostic reasoning. Here, with limited text and an emphasis on image-grounded interpretation, the models’ weaknesses in localisation, sequence recognition and faithful description were harder to mask.
Across 100 neuroradiology cases, VLMs showed limited diagnostic accuracy and substantial rates of potentially harmful outputs compared with experienced neuroradiologists. Recurring errors in anatomical localisation, sequence identification and description of imaging features, along with occasional hallucinated findings, constrain clinical readiness. Even when evaluating top-three differentials, model performance remained well below human accuracy. In settings where brief clinical information is the norm and correct interpretation of imaging sequences is essential, specialist oversight remains critical. Continued progress may narrow the gap, but current VLM performance supports a cautious approach, using these systems as experimental aids rather than independent diagnostic tools in neuroradiology.
Source: npj digital medicine
Image Credit: iStock