Accurately classifying prostate cancer (PCa) aggressiveness is essential for effective treatment planning, yet traditional methods such as biopsy carry risks and may lead to sampling errors. Biparametric MRI (bpMRI), a noninvasive imaging technique, has gained prominence in PCa staging. The rise of deep learning (DL) offers potential for automating aggressiveness classification using bpMRI, reducing reliance on invasive procedures. However, model performance can be significantly influenced by external factors, including scanner manufacturer, the use of endorectal coils (ERC) and the presence of clinical variables. Understanding how these elements affect predictive accuracy is vital to ensuring reliable and generalisable AI tools for PCa management.
Influence of Imaging Equipment and Protocols
The scanner manufacturer and scanning protocols significantly influenced DL model performance. The dataset used for model training comprised over 5400 bpMRI studies sourced from multiple European centres, with three major scanner brands—Siemens, Philips and GE—represented. Among GE studies, data were further stratified based on ERC use. Models trained on data from a particular manufacturer consistently performed better when tested on data from the same manufacturer, demonstrating a notable decrease in predictive accuracy (mean AUC drop of 0.05) when tested on data from different sources.
Must Read: Optimising Gadolinium Use in Prostate MRI Through On-Table Monitoring
Notably, GE scanners with ERC produced the lowest-performing models, while Philips-based models outperformed even those trained on larger Siemens datasets. The variability in scanner hardware and acquisition protocols altered deep feature distributions extracted by the models, with visual and quantitative analysis confirming that DL features clustered more strongly by manufacturer than by cancer aggressiveness. These results underscore how scanner-specific characteristics heavily influence model learning, especially in the absence of lesion localisation input.
Model Architecture and Data Composition
Model architecture had a direct impact on classification outcomes. Simpler architectures, such as the VGG-based model, consistently outperformed more modern networks like ResNet, ConvNeXt and vision transformers (ViTs). Across both cross-validation and held-out test sets, VGG models achieved higher AUCs, with a notable advantage of 0.016 in the test set over the next best-performing model. ViT models, particularly, underperformed—likely due to their data-intensive nature and reduced inductive bias compared to convolutional networks.
Combining multiple MRI sequences—T2-weighted, diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC)—also improved performance. The inclusion of DWI and ADC sequences alongside T2-weighted images resulted in an AUC increase of 0.04, reinforcing the importance of multiparametric input for robust classification. Learning curve analyses further showed that increasing training data generally improved performance, especially when models were tested on data from the same manufacturer. However, performance gains plateaued when ERC data from GE was involved, indicating a ceiling effect tied to protocol quality.
Clinical Variables and Generalisability
The integration of clinical variables—age, prostate-specific antigen (PSA) level and PI-RADS scores—into hybrid models did not yield significant improvements. Both statistical analysis and cross-validation indicated no notable difference in performance between models using only imaging data and those incorporating clinical inputs. Even elastic net-regularised linear classification models, which integrated probability outputs from DL models with clinical features, failed to show statistically significant gains, except under alternative ISUP stratifications (ISUP 1–2 vs 3–5).
The inclusion of all manufacturer data during model training provided the most stable results across different testing scenarios, demonstrating improved generalisability. Nevertheless, even these “full” models exhibited scanner-specific feature clustering in deep feature space. This suggests that while comprehensive datasets enhance robustness, achieving true generalisability still requires careful consideration of acquisition diversity. Interestingly, the performance of DL models in this study mirrored human radiologist benchmarks reported in past trials, with sensitivity values exceeding 90% and specificity around 30%, reflecting the ongoing challenge of improving specificity without sacrificing sensitivity.
The study highlighted the critical influence of scanner manufacturer, scanning protocol and architectural choices on the performance of DL models for classifying PCa aggressiveness using bpMRI. The findings emphasise the importance of diverse and well-balanced datasets in training DL models, particularly when aiming for deployment across multiple clinical settings. Simpler convolutional models and multiparametric MRI sequences proved most effective, while the inclusion of clinical variables offered limited value. Although the study demonstrated promising results for automated PCa assessment, future work should explore the addition of lesion-level annotations, molecular data and prospective validation in more demographically diverse populations to further enhance model fairness and clinical utility.
Source: Radiology: Artificial Intelligence
Image Credit: Freepik