Image quality assessment (IQA) is a foundational element in both clinical practice and medical image algorithm development. Yet, commonly used full reference (FR) IQA metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) were designed for natural images and have not been adequately validated for medical applications. Despite this, these metrics continue to dominate evaluations in research involving medical image processing. This mismatch between metric design and clinical relevance leads to inconsistencies that can distort conclusions about algorithm performance and ultimately hinder the translation of research into real-world healthcare outcomes. A recent review published in the Journal of Imaging Informatics in Medicine summarises key pitfalls associated with applying FR-IQA in medical imaging and presents insights from a multi-modal, expert-driven study that advocates for a more informed, task-specific approach. 

 

The Disconnect Between Metric Design and Clinical Use 
PSNR and SSIM were designed to measure pixel-level differences and perceptual similarity in natural images, where distortions are often uniform and data availability is high. In contrast, medical images are modality-specific, contain subtle yet clinically critical details and are often protected due to privacy concerns. This limits the availability of annotated datasets and leads to a scarcity of validated IQA tools tailored for medical contexts. 

 

Furthermore, FR-IQA requires reference images, making it suitable only for controlled development stages but inadequate for real-time clinical evaluations. These metrics are still widely used to benchmark algorithms for denoising, reconstruction and super-resolution in CT, MRI, X-ray and OCT imaging. However, their shortcomings become evident when they misjudge images that visually appear to be of higher or lower diagnostic utility. For instance, a blurred yet high-scoring reconstruction may obscure a tumour, while a noisier image may retain crucial diagnostic detail. This mismatch between metric output and clinical utility has been confirmed across various modalities, revealing the need for better-suited evaluation strategies. 

 

Evidence of FR-IQA Failures Across Modalities 
Across the study, several examples highlight how PSNR, SSIM and even newer metrics like LPIPS (which leverages neural networks for perceptual similarity) fall short. In CT imaging, advanced machine learning-based reconstructions that visually obscure tumours still score highest under PSNR and SSIM. MRI examples show reconstructions with significant blurring rated better than sharper images by FR metrics, misaligning with radiologist preferences. In OCT, reconstructions with poor axial resolution due to incorrect dispersion compensation are judged favourably, despite being diagnostically compromised.

 

Must Read: Enhancing 82Rb PET/CT Image Quality with Positron Range Correction

 

Similar issues occur in digital pathology and X-ray imaging. In pathology, images scanned with fewer focus points, which appear blurry to the human eye, score better simply due to minor spatial misalignments in higher-quality scans. In X-ray, post-processing that enhances edges or adjusts contrast for clinical visibility is penalised by FR metrics that do not account for the diagnostic goals of the image.

 

Even LPIPS, while more robust to small spatial shifts, fails in several scenarios where clinical priorities like lesion visibility, tissue contrast or structural integrity matter more than superficial perceptual similarity. The study also notes that implementations of SSIM vary, with different kernels and batch sizes yielding inconsistent results, further complicating reproducibility.

 

Toward Task-Informed, Clinically Aligned Evaluation 
The study calls for a paradigm shift. Instead of defaulting to general-purpose FR-IQA metrics, researchers and developers should adopt evaluation frameworks that reflect the actual clinical use case. This means leveraging No Reference (NR) IQA where applicable or developing task-specific FR measures when reference images are available. Key recommendations include:

  1. Establishing datasets with expert ratings to correlate IQA outputs with clinical judgment. 
  2. Publishing detailed implementation parameters, visualisation strategies and preprocessing steps for reproducibility. 
  3. Avoiding the dual use of IQA metrics as both training loss functions and evaluation tools, which introduces bias. 
  4. Visual inspection by domain experts should be included whenever possible, especially when evaluating methods intended for diagnostic tasks. 

 

The study further outlines guidelines for using FR-IQA responsibly. These include assessing metric suitability for the task, defining relevant regions of interest, standardising image preprocessing and choosing metrics based on the perceptual qualities relevant to the application (e.g., contrast, structure, brightness). Sharing code and making evaluation frameworks public is also encouraged to support transparency and broader validation.

 

The continued use of general-purpose FR-IQA metrics in medical imaging threatens the accuracy and clinical relevance of algorithm evaluations. Evidence from real-world examples across CT, MRI, X-ray, OCT, digital pathology and photoacoustic imaging shows that these metrics often misrepresent image quality in ways that could impact patient care if carried into clinical practice. To bridge the gap between algorithm development and clinical deployment, the medical imaging community must move towards evaluation frameworks that reflect diagnostic needs. This requires a collaborative effort between engineers, radiologists and data scientists to define meaningful, reproducible and clinically grounded quality assessments.

 

Source: Journal of Imaging Informatics in Medicine 

 

 Image Credit: Freepik 


References:

Breger A, Biguri A, Landma MS et al. (2025) A Study of Why We Need to Reassess Full Reference Image Quality Assessment with Medical Images. J Digit Imaging. Inform. med. 



Latest Articles

medical image quality, PSNR, SSIM, FR-IQA, medical imaging evaluation, diagnostic imaging metrics, healthcare AI, MRI, CT, radiology tools, image reconstruction Current image quality metrics fail clinical relevance in medical imaging—new research urges task-based evaluation.