Clear cell renal cell carcinoma (ccRCC) represents the most common subtype of kidney cancer and remains a clinical challenge due to its variable progression and risk of recurrence. Despite early detection and treatment through surgery or ablation, up to one-third of cases may relapse, necessitating accurate risk assessment prior to treatment. The SSIGN score, incorporating tumour stage, size, grade and necrosis, is an established postoperative tool to stratify patient prognosis. However, its reliance on pathological data limits its utility in preoperative planning. To address this gap, researchers have developed an interpretable CT-based vision transformer (ViT) model capable of predicting the SSIGN score and clinical outcomes in ccRCC patients using preoperative imaging alone.
Model Development Using Vision Transformers
A retrospective multicentre study analysed 845 patients with pathologically confirmed ccRCC who underwent contrast-enhanced CT scans within 15 days before surgery. The scans were acquired during two key phases—cortical medullary phase (CMP) and renal parenchymal phase (RPP). From these images, three ViT-based models were developed: the CMP ViT model (CVM), the RPP ViT model (RVM) and a combined model (CRVM) incorporating both phases.
Must Read: ML Advances Non-Invasive Grading of Renal Tumours
For each patient, 768 features were extracted from the CMP and RPP images respectively using a pre-trained ViT architecture. These features were reduced using minimum redundancy maximum relevance (mRMR) and LASSO regression, resulting in a refined set of 17 features for CVM, 16 for RVM and 19 for CRVM. Logistic regression was used to train the models, and performance was evaluated using the area under the receiver operating characteristic curve (AUC), along with accuracy, sensitivity and specificity.
Among the models, CRVM showed the best overall performance with an AUC of 0.895 in both the training and test cohorts. It also offered higher accuracy and clinical utility, as demonstrated by decision curve analysis. By combining the strengths of both image phases, CRVM could capture a more comprehensive view of tumour characteristics and heterogeneity, leading to better risk stratification than either single-phase model.
Interpretability and Prognostic Value
A major advancement of this study lies in the interpretability of the CRVM model. To provide transparency and clinical trust, the researchers employed SHAP (Shapley Additive Explanations), a method that quantifies the contribution of each feature to the model’s prediction. The SHAP summary plots revealed which ViT-derived features most influenced the model’s output, while SHAP waterfall plots demonstrated how these features drove risk classification in individual patients. This level of insight is critical for integrating AI tools into clinical workflows, where understanding the rationale behind predictions is often as important as their accuracy.
Beyond predicting SSIGN scores, the CRVM was also assessed for its ability to predict progression-free survival (PFS). Using Kaplan–Meier analysis, the model effectively distinguished between low-risk and intermediate-to-high-risk groups, with the CRVM showing a Harrell’s concordance index (C-index) of 0.840 in the test cohort. This exceeded the C-index of both CVM and RVM, reinforcing CRVM’s potential for identifying patients who may require more intensive postoperative monitoring or adjuvant therapy.
Clinical Implications and Future Directions
The application of ViT models in radiology is a relatively new development, but their ability to analyse global context in imaging has shown promise in several cancers. In this study, the CRVM demonstrated a robust capability to non-invasively predict both prognosis and outcome in ccRCC. This may support clinicians in tailoring surgical and therapeutic strategies prior to biopsy or operation, improving patient care through precision medicine.
However, the study is not without limitations. As a retrospective analysis involving multiple centres and varied CT scanners, the findings may be affected by differences in imaging protocols. Standardisation efforts, including image normalisation, were applied, yet prospective validation remains necessary. Furthermore, the current model focused solely on imaging-derived features and did not incorporate clinical, biochemical or demographic data, which may offer additional predictive value. Combining ViT features with clinical variables or exploring hybrid architectures that integrate convolutional neural networks (CNNs) could further enhance model performance and generalisability.
Another limitation is the exclusion of traditional radiomics features, which may provide complementary information to ViT outputs. Future research should investigate whether integrating radiomics, CNNs and ViTs in a unified framework could produce a more comprehensive and accurate model for clinical deployment.
The study highlighted the value of an interpretable, CT-based vision transformer model for preoperative prediction of the SSIGN score and clinical outcome in clear cell renal cell carcinoma. The CRVM outperformed single-phase models and provided valuable prognostic information, with high accuracy and strong interpretability via SHAP. These findings suggest that ViT-based tools can support risk stratification and personalised management in renal cancer, marking a step forward in non-invasive cancer diagnostics and AI-assisted radiology. Further prospective and integrative research will help translate this promising technology into routine clinical practice.
Source: Insights into Imaging
Image Credit: iStock