Multimodal Fusion Boosts Whole-Slide Pathology Classification

In Artificial intelligence
Sun, 21 Sep 2025

Whole slide images in digital pathology capture detailed tissue morphology but their gigapixel scale creates computational and annotation barriers for routine automation. A multimodal approach that combines these images with information contained in pathology reports promises gains in accuracy and consistency without exhaustive pixel-level labels. MPath-Net, an end-to-end framework that fuses image features from multiple-instance learning with text embeddings from transformer models, targets tumour subtype classification across kidney and lung cancers using publicly available data. Reported performance improvements over established unimodal and multimodal baselines underscore the value of integrating visual and textual signals for decision support in cancer diagnostics and workflow efficiency.

Dataset and Clinical Task

The framework is evaluated on a curated selection from The Cancer Genome Atlas encompassing 1684 cases, with 916 kidney and 768 lung records, spanning five disease types: Kidney Renal Clear Cell Carcinoma, Kidney Renal Papillary Cell Carcinoma, Kidney Chromophobe, Lung Adenocarcinoma and Lung Squamous Cell Carcinoma. Patient-level splits allocate 70% for training, 10% for validation and 20% for testing, fixed across experiments to ensure consistent comparisons. Only pathology reports that were clear and readable in PDF format were retained, and optical character recognition was used to create a clean, machine-readable corpus. The task is multi-class subtype classification, aligning with clinical aims such as tailoring therapies, guiding monitoring and reducing variability in interpretation.

WSIs are preprocessed into tissue-containing patches and filtered to remove background using an edge-based criterion. Multiple-instance learning treats each slide as a bag of instances, enabling weak supervision when patch-level labels are unavailable. In this context, the slide label supervises learning while the model reasons over patch embeddings to infer slide-level subtype. This setting mirrors clinical practice where image patterns and narrative reports are considered together to delineate tumour characteristics and support subtype assignment.

Model Design and Training

MPath-Net adopts a feature-level fusion strategy. On the imaging pathway, a Dual-Stream Multiple Instance Learning architecture with a ResNet-18 backbone produces a 512-dimensional slide representation from tissue patches. The final classification layer of the MIL network is removed, and embeddings from the aggregator are used as image features. On the text pathway, pathology reports are embedded with transformer models. A Sentence-BERT encoder delivers 768-dimensional representations that are projected via a trainable multilayer perception to 512 dimensions to align with the image feature space. The concatenated 1024-dimensional vector passes through fully connected layers with normalisation, rectified linear activations and dropout before softmax output over five classes. During end-to-end training, the image encoder, fusion layers and classifier are optimised while the text encoder remains frozen to preserve domain representations.

Must Read: Sustainable AI Benchmarking in Histopathology

Training uses Adam with a learning rate of 0.0001 and a batch size of 1 on A100 GPUs. Evaluation adopts accuracy, precision, recall, F1-score and area under the receiver operating characteristic curve, with 95% confidence intervals obtained by bootstrap resampling. To isolate the effect of fusion and architecture, all unimodal MIL baselines are trained with the same SimCLR-pretrained ResNet-18 backbone used in MPath-Net. Text encoders explored include BioBERT, ClinicalBERT, Clinical BioBERT, PathologyBERT and Sentence-BERT, the Sentence-BERT variant is selected for the primary model based on performance.

Performance and Interpretability

Across the five-class task, the Sentence-BERT-based variant achieves accuracy of 0.9487, precision of 0.9495, recall of 0.9445, F1-score of 0.9460 and AUC of 0.9902. These results exceed those of MIL baselines including TransMIL, ACMIL, ABMIL, MaxMIL and MeanMIL, with p-values consistently below 0.05 for accuracy, precision, recall and F1-score. Class-wise comparisons indicate leading performance in three of five classes. The multimodal model also surpasses unimodal image-only and text-only configurations across the main threshold-based metrics, while a comparison notes that the image-only variant can attain a slightly higher AUC, reflecting a trade-off between ranking capacity and thresholded classification when textual features are incorporated.

Interpretability is addressed through attention heatmaps that highlight regions of interest at patch level on whole slides. Normalised attention scores mapped to colours help localise tumour-suggestive tissue, offering a visual explanation aligned with pathologist reviews. Illustrations compare attention patterns from baseline MIL models with the multimodal output, showing concentrated regions where tumour evidence is strongest. This transparency supports potential clinical utility by aligning model focus with tissue areas that drive classification decisions.

An end-to-end multimodal framework that fuses weakly supervised whole-slide embeddings with transformer-based report features improves tumour subtype classification for kidney and lung cancers within a large public cohort. By maintaining a shared representation space for images and text and training the visual pathway with fusion layers jointly, MPath-Net advances accuracy and consistency over established MIL baselines and unimodal counterparts. With statistically significant gains across core metrics and attention-based visualisations that expose slide regions underpinning predictions, the approach illustrates how integrating complementary data streams can support pathologists, reduce inter-reader variability and align with precision oncology goals.

Source: Journal of Healthcare Informatics Research

Image Credit: iStock

References:

Song Y, Roy M, Zhong M et al (2025) Multimodal Data Fusion for Whole-Slide Histopathology Image Classification. Journal of Healthcare Informatics Research: In Press.

digital pathology, lung cancer, kidney cancer, precision oncology, Medical AI, pathology reports, transformer models, multimodal fusion, whole-slide pathology, MPath-Net, cancer subtype classification

Latest Articles

Hospitals of the Future: The Next Frontier in Patient-Centred Care
- Journal Article
- 18/10/2025
Hospitals are rapidly evolving into smart, connected ecosystems focused on proactive, personalised care. Leveraging AI, robotics, remote monitoring and digital health tools, they enhance diagnostics, improve workflows and support decentralised models like virtual wards. Predictive analytics, interoper
READ MORE
AI Orchestration in Emergency Radiology – Implementation in the Valencia Health Region
- Journal Article
- 18/10/2025
The Valencia Health Region deployed a vendor-neutral AI orchestration system across 29 hospitals to improve emergency radiology. Validated at Hospital General Universitario Dr Balmis, it streamlines triage, accelerates diagnoses and reduces radiologists’ workload. The system processes over 5,700 studi
READ MORE
Advancement of 3D Printing in Healthcare and Its Impact on Sustainability
- Journal Article
- 18/10/2025
3D printing is transforming healthcare through personalised devices, surgical precision and faster prototyping while advancing sustainability. On-demand production reduces waste, supports circular economy models and lowers carbon footprints by minimising transport and inventory. Despite its promise,...
READ MORE

multimodal fusion, whole-slide pathology, digital pathology, MPath-Net, cancer subtype classification, kidney cancer, lung cancer, transformer models, pathology reports, medical AI, precision oncology MPath-Net fuses whole-slide images and pathology reports to boost cancer subtype classification accuracy and support precision oncology.

Multimodal Fusion Boosts Whole-Slide Pathology Classification

References:

Latest Articles

Related Articles

Latest News

INFO

IMAGING

ICU

EXEC

IT

CARDIOLOGY

JOURNALS

EVENTS

FACULTY

PARTNERS

JOBS

COMPANIES

PRODUCTS

BLOG

VIDEOS

Communities

CONTACT US

EU Office

Rue Villain XIV 53-55

B-1050 Brussels, Belgium

Tel: +357 86 870 007

E-mail: [email protected]

EMEA & ROW Office

166, Agias Filaxeos

CY-3083, Limassol, Cyprus

Tel: +357 86 870 007

E-mail: [email protected]

Headquarters

Kosta Ourani, 5

Petoussis Court, 5th floor

CY-3085 Limassol, Cyprus

E-mail: [email protected]