Respiratory diseases place a major burden on health systems, especially among adult and elderly populations, where symptoms such as cough, sputum production and shortness of breath are often non-specific. AI-based cough analysis has emerged as a non-invasive approach to support early screening, but real-world use remains limited by device variability and incomplete clinical context. A multimodal deep learning framework combines cough acoustics, demographic information and symptom descriptions to classify respiratory diseases. By incorporating adversarial training and invariant risk minimisation, it aims to improve robustness and generalisation across different recording devices and clinical settings.
Multimodal Framework and Data Design
The framework combines cough audio, demographic variables and symptom descriptions to capture complementary diagnostic signals. Evaluation is conducted on a multi-centre cohort of 12,378 adult outpatients recruited from four clinical centres, with data collected using a range of recording devices. Each participant provides cough recordings of at least 10 seconds, processed through a structured quality control pipeline that performs segmentation, event detection and validation to ensure reliability. Standardised three-second segments are extracted around cough onset, covering one second before and two seconds after the cough burst, while device metadata is retained for each segment to enable assessment of acquisition variability.
Must Read: Functional Group Integration Enhances DDI Models
Structured questionnaires provide demographic characteristics, smoking history and detailed symptom profiles, including cough frequency, duration, sputum characteristics and dyspnoea. Disease labels are assigned based on clinical diagnoses and verified through independent review, ensuring consistency across categories. The dataset supports both binary and multi-label classification tasks across seven respiratory diseases, allowing simultaneous identification of coexisting conditions. Data partitioning ensures that recordings from the same device and clinical centre remain within a single fold, preventing information leakage during evaluation. Class imbalance is present across tasks, with lower prevalence observed for certain conditions, particularly pulmonary shadows. This dataset design reflects real-world clinical complexity and supports robust evaluation across heterogeneous populations, devices and disease distributions.
Model Architecture and Device Robustness
The architecture integrates an audio encoder and a text encoder within a multimodal fusion framework. Cough recordings are transformed into mel-spectrogram representations and processed through a transformer-based encoder that captures temporal and spectral features associated with disease patterns. Demographic and symptom data are encoded into contextual embeddings using a language model and aligned with audio features through cross-modal attention mechanisms. This interaction enables the model to learn relationships between acoustic signals and clinical attributes, producing a unified representation for classification across multiple disease categories.
Device variability is addressed through an adversarial training mechanism embedded in the audio encoder. A gradient reversal layer connects audio features to a device classifier, encouraging the model to suppress device-specific information while preserving clinically relevant patterns. Invariant risk minimisation is incorporated into the loss function to enforce consistency across device environments and reduce sensitivity to distributional shifts. The optimisation framework integrates device invariance with disease classification, contrastive alignment between modalities and uncertainty-based weighting of loss components. Ablation analyses demonstrate that audio features provide strong baseline performance, while demographic and symptom data contribute complementary information. Full multimodal integration combined with adversarial training yields the highest performance and improved generalisability across heterogeneous acquisition conditions.
Diagnostic Performance and Clinical Relevance
The framework demonstrates strong performance across multiple classification tasks. For chronic obstructive pulmonary disease, mean AUROC exceeds 0.96 on both validation and test sets, with low variance indicating stable discrimination. For lower respiratory tract infection, AUROC values above 0.84 reflect robustness despite clinical heterogeneity. Performance for pulmonary shadows exceeds 0.87 in AUROC, although lower precision and AUPRC values highlight challenges associated with class imbalance and clinically silent presentations. These findings position the model as a triage or risk stratification tool rather than a standalone diagnostic solution for such conditions.
In multi-label classification across seven respiratory diseases, the framework achieves superior AUROC and AUPRC compared with baseline models, demonstrating effective integration of multimodal data and strong generalisation. Cross-device evaluation shows that models without adversarial mechanisms experience performance degradation on unseen devices, whereas adversarially trained models maintain stable performance, confirming the effectiveness of device-invariant learning. Additional experiments show that incorporating invariant risk minimisation improves training stability and reduces performance variance across repeated runs. Model scaling analyses indicate that larger models achieve slightly higher accuracy, while smaller configurations offer advantages in inference speed and resource efficiency, supporting deployment in mobile and resource-constrained environments.
A device-invariant multimodal learning framework enables robust classification of respiratory diseases using cough acoustics, demographic information and symptom descriptions. The integration of adversarial training and invariant risk minimisation improves generalisation across heterogeneous devices and clinical settings, supporting reliable performance in real-world deployment scenarios. Strong results across binary and multi-label tasks highlight its potential for scalable screening, triage and decision support in primary care and remote monitoring contexts. Remaining challenges include class imbalance, limited interpretability and constrained population diversity. Further validation in real-world environments and expansion to additional data modalities will be essential to strengthen clinical applicability and support broader adoption across healthcare systems.
Source: npj Digital Medicine
Image Credit: iStock
References:
Yang M, Liu X, Du W et al. (2026) A device-invariant multi-modal learning framework for respiratory disease classification. npj Digit Med: In Press.