Vision-language models (VLMs) are being explored for clinically oriented image interpretation across multiple modalities, yet their behaviour degrades when images contain artefacts that are common in practice. On clean datasets spanning brain MRI, retinal OCT (optical coherence tomography) and chest X-ray, overall accuracy was only moderate, hovering around 0.60 across tasks. Introducing weak artefacts reduced performance further by single-digit to low double-digit percentages depending on modality, while severe distortions were rarely flagged as ungradable, with detection rates typically near 0.10–0.20. The pattern points to consistent risks: artefacts can create spurious cues that resemble disease or conceal genuine pathology, leading to false positives or false negatives. Findings suggest that artefact-aware evaluation and explicit quality checks are required before such systems can be relied upon in demanding clinical workflows.

 

Benchmark Across MRI, OCT and X-Ray

The evaluation covered three representative tasks with 200 images per modality, balanced between normal and diseased cases, then applied five artefact types at two severities. Intensity artefacts—such as bias field variation, motion and noise—were assessed alongside spatial artefacts like cropping and rotation. Weak artefacts partially degraded images but kept them interpretable, while strong artefacts produced ungradable images for which safe classification should be deferred. Performance was tracked through standard disease-detection metrics on clean images, percentage change under weak artefacts and a strong-artefact detection rate reflecting whether models would flag poor quality rather than attempt diagnosis.

 

Error patterns echoed typical clinical challenges. Motion on OCT tended to flip correct normal classifications into false positives and added noise converted correctly detected abnormalities into false negatives. Spatial artefacts affected tasks unevenly. Rotation could scramble expected anatomy and depress accuracy, whereas certain cropping patterns sometimes concentrated attention on diagnostically relevant regions. Across modalities, these perturbations revealed brittle behaviour that emerged even at weak artefact levels, exposing how fragile learned cues can be when acquisition conditions vary from ideal settings.

 

Impact of Weak Artefacts on Detection

On original images, headline performance varied by modality and model, but none delivered consistently high accuracy across all tasks. Baseline results around 0.60 made subsequent declines under weak artefacts more consequential for clinical usability. When noise was added to chest X-ray images, for example, one of the stronger approaches saw accuracy collapse to 0.466, a drop approaching 40% from its clean-image performance. In contrast, other intensity artefacts in the same setting produced smaller but still meaningful losses. OCT proved sensitive to motion, which shifted some previously correct outputs into errors. MRI outcomes were mixed: weak noise and motion tended to hurt, yet certain spatial tweaks occasionally yielded improvements.

 

Not all artefacts were uniformly harmful. In brain MRI, weak random cropping sometimes improved tumour detection, lifting accuracy for some configurations by more than ten percentage points compared with unaltered images. A mild bias field could also nudge performance upward in specific setups. These effects did not generalise across modalities and were not consistent across models, but they underscore that artefacts can sometimes highlight central tissues or lesions in ways that incidentally aid discrimination. The broader trend remained negative, however, with weak perturbations frequently pushing moderate clean-image performance into ranges unlikely to be acceptable for frontline decision support.

 

Must Read: Self-Supervised Speckle Reduction for Ultrasound Imaging

 

Recognising Ungradable Images and the Role of Prompts

Safely identifying ungradable images is crucial, yet strong-artefact detection rates were generally low, often near 0.10–0.20 across modalities. There were isolated bright spots: for brain MRI with strong motion, one configuration reached a detection rate above 0.80, while others remained substantially lower on the same task. In OCT, certain intensity artefacts proved easier to spot than spatial distortions, with rotation among the hardest to recognise as disqualifying. This unevenness indicates that many systems still attempt diagnosis instead of deferring when image quality falls outside a safe operating envelope.

 

Prompting strategy influenced behaviour. Switching from structured outputs to more standard prompts sometimes improved the ability to flag poor quality. On MRI with strong noise, one model’s detection rose from near zero to well above 0.60 after a prompt change. The gains did not come without trade-offs. In chest X-ray with weak motion, a chain-of-thought configuration reduced sensitivity from about 0.81 to roughly 0.33, largely because refusals or ambiguous responses were counted as incorrect. These dynamics show that robustness is not just a property of model weights but also of interface choices that can either surface caution or suppress core diagnostic signals.

 

A complementary real-world experiment with colour fundus photographs reinforced the main findings. On high-quality images, the leading configuration achieved accuracy in the low 0.70s with high specificity, whereas a weaker baseline lingered below 0.50. Introducing weak artefacts reduced accuracy across models, with the largest drops exceeding 40% for some. For clearly ungradable images, one approach topped a poor-quality detection rate around 0.78, yet this did not translate into gains where images remained interpretable but degraded.

 

Across MRI, OCT and X-ray, medical VLMs showed moderate accuracy on clean images and deteriorated further with weak artefacts, while strong-artefact detection was generally low. Occasional improvements from spatial tweaks and prompt-driven gains in quality-flagging were offset by modality-specific fragility and sensitivity trade-offs. Embedding artefact-aware benchmarks, routine quality checks and careful prompt design into development and evaluation can help reduce misclassification risk and support safer deployment in real clinical workflows.

 

Source: npj digital medicine

Image Credit: iStock


References:

Cheng Z, Ong AY, Wagner SK et al. (2025) Understanding the robustness of vision-language models to medical image artefacts. npj Digit Med; 8, 727.



Latest Articles

medical imaging, healthcare technology, clinical decision support, medical AI, Radiology AI, Vision-Language Models Vision-language models (VLMs) are being explored for clinically oriented image interpretation across multiple modalities, yet their behaviour degrades...