Fragmented clinical data remain a challenge in the management of bladder cancer, renal cell carcinoma and prostate cancer, where imaging, pathology, genomic profiling and laboratory variables are often interpreted in isolation. UroFusion-X is a multimodal deep learning framework designed to integrate these heterogeneous data sources within a unified architecture. The system supports diagnostic classification, molecular subtyping and survival prediction while maintaining functionality when certain modalities are unavailable. Its design incorporates modality-specific encoders and cross-modal fusion strategies intended to align representations across data types and reduce performance degradation under incomplete inputs. Evaluation is reported across multi-centre cohorts with internal and external validation, including testing under institutional hold-out scenarios. Comparative analyses are conducted against unimodal models, standard multimodal baselines and selected clinical scoring systems. Reported outcomes cover diagnostic discrimination, subtyping performance, survival stratification, robustness to modality dropout and measures of clinical net benefit.
Integrated Architecture for Heterogeneous Clinical Data
The framework combines dedicated encoders for imaging, digital pathology, omics data and structured clinical variables. Imaging inputs from modalities such as CT, MRI and ultrasound are processed using a transformer-based model. Whole-slide pathology images are handled through a multiple-instance learning approach. Omics data are modelled with a graph-based structure reflecting pathway relationships, while laboratory and clinical features are encoded using a transformer architecture designed for tabular data. These modality-specific representations are brought together through a two-stage fusion strategy that supports cross-modal interaction and adaptive weighting.
Must Read: Multimodal AI Sharpens Recurrence Risk in Clear Cell RCC
A co-attention mechanism enables token-level exchange of information between modalities, promoting alignment of complementary features. Fusion is implemented using a gated product-of-experts approach, which assigns weights based on modality availability and learned relevance. Modality presence masks allow the system to operate without architectural modification when inputs are missing. The training process incorporates modality dropout so that the model is exposed to incomplete combinations during optimisation, reducing reliance on any single data source. Additional constraints are introduced to strengthen correspondence between radiological regions and pathology attention maps, supporting interpretability while preserving predictive performance. Prognostic modelling is implemented using survival prediction heads designed to estimate risk distributions across the three cancer types.
Performance Across Diagnosis, Subtyping and Prognosis
Across bladder, renal and prostate cancers, the framework achieved high diagnostic discrimination, with AUROC values approaching or exceeding 0.9 in two tumour types and remaining strong in the third. Performance was superior to imaging-only models and conventional multimodal fusion approaches. For molecular subtyping in bladder and renal cancers, F1-scores are in the mid-to-high 0.8 range, exceeding pathology-only and simpler multimodal baselines. These results suggest that integration of genomic and histopathological signals enhances subtype differentiation.
Cross-centre validation indicates limited performance degradation when individual institutions are excluded during training. Reported AUROC declines are generally modest, although larger reductions are observed in centres with lower availability of genomic data or heterogeneous pathology coverage. A multi-task formulation is associated with reduced variance under institutional shifts compared with single-task configurations, supporting stability across diverse clinical environments.
In survival modelling, the system stratifies patients into distinct risk groups with early separation of survival curves. C-index values are in the low-to-mid 0.7 range across the three cancers, exceeding selected clinical scoring systems cited in the low 0.6 range. Differences between high-risk and low-risk groups are reflected in reported survival durations and statistically significant log-rank results. These findings position the framework as a multimodal tool capable of supporting risk assessment alongside diagnosis and subtyping.
Robustness, Ablation and Clinical Utility
Robustness analyses assess the effect of removing individual modalities at inference. Performance declines are limited, generally within a few percentage points of AUROC when imaging, pathology or genomic inputs are omitted. With the gated fusion mechanism in place, average degradation under missing-modality scenarios remains relatively small, whereas replacement with simple concatenation produces more pronounced declines. In specific scenarios such as absent imaging, performance collapse is reported when adaptive fusion is not used.
Ablation experiments highlight the contribution of individual components. Removal of cross-modal attention results in measurable reductions in diagnostic and subtyping metrics. Excluding modality dropout during training leads to larger performance drops when inputs are missing at inference. The anatomy–pathology consistency constraint primarily enhances spatial correspondence between modalities, increasing alignment metrics substantially while exerting only a modest influence on overall discrimination performance.
Clinical utility is evaluated through decision curve analysis across a wide range of threshold probabilities. At a representative mid-range threshold, the framework demonstrates higher net benefit and lower false-positive rates compared with selected clinical scoring systems. Limitations are acknowledged, including reliance on retrospective datasets, sensitivity under extreme data sparsity and computational demands, with training requiring high-end hardware over extended periods and multimodal inference taking several minutes per patient.
UroFusion-X presents a multimodal framework integrating imaging, pathology, omics and clinical data for bladder, renal and prostate cancers. Reported results indicate strong diagnostic discrimination, competitive subtyping accuracy and improved survival stratification relative to selected baselines, alongside resilience to missing modalities. Architectural features such as cross-modal attention, adaptive fusion and modality dropout contribute to robustness and interpretability. While limitations related to retrospective design, data sparsity and computational requirements remain, the framework illustrates how coordinated multimodal learning can support more integrated decision-making across diverse urological cancer workflows.
Source: npj digital medicine
Image Credit: iStock