The emergence of digital medicine has been tightly interwoven with the proliferation of deep learning models, which require extensive and diverse datasets for effective development and validation. However, stringent privacy regulations, particularly in healthcare, limit the accessibility and sharing of real patient data. This tension between data needs and privacy preservation has intensified interest in synthetic health records (SHRs).
SHRs are artificially generated datasets designed to mirror real-world electronic health records (EHRs) while ensuring that no identifiable patient information is retained. A recent scoping review offers a comprehensive analysis of deep learning models capable of generating synthetic medical text, time series and longitudinal data. It explores their methodological strengths, data modalities, objectives and performance metrics, shedding light on both the promises and challenges of SHR implementation in healthcare innovation.
Synthetic Time Series for Physiological Modelling
Synthetic time series data generation is pivotal for modelling physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG), which are widely used in diagnostic and monitoring contexts. In this domain, 22 studies focused on time series generation, with the majority targeting data scarcity and privacy concerns. GAN-based models were dominant due to their capacity to replicate temporal dynamics, though issues such as mode collapse and sensitivity to hyperparameters were noted.
Must Read: Scalable Clinical AI with Synthetic Data Distillation
Diffusion models emerged as an alternative, demonstrating superior performance in capturing long-term dependencies in physiological patterns. Applications of synthetic time series extend beyond simple data replication; they support minority class generation, improve model robustness in low-resource settings and aid in imputation where signal gaps exist. Nevertheless, evaluation fidelity remains challenging, especially in translating synthetic patterns into clinically meaningful outputs. The fidelity of these synthetic records is often assessed visually or through benchmarking with classifiers, but the lack of standardised metrics for re-identification risk complicates the validation process.
Preserving Privacy with Synthetic Longitudinal Data
Longitudinal data encapsulate patient trajectories across multiple visits and timepoints, offering a rich source for understanding chronic disease progression, treatment patterns and long-term health outcomes. In the reviewed studies, 17 papers explored synthetic longitudinal data generation, with privacy protection cited as the primary motivation. GANs again featured prominently, often combined with graph-based and probabilistic approaches to capture the complex interdependencies within patient histories. Some models applied autoregressive or mixed-type architectures to handle the variety of data types—demographics, diagnoses, vitals—within EHRs. Yet, many public datasets used for training are ICU-centric, thereby underrepresenting non-acute cases and demographic diversity.
Limitations in dataset generalisability and the linguistic homogeneity of English-language records present challenges to scalability. Moreover, while fidelity and utility of generated data are typically assessed, few studies adequately quantify the risk of re-identification. The absence of universally accepted performance metrics for privacy evaluation hinders robust model comparison and slows adoption in regulatory-sensitive environments.
Generating Clinical Text: Opportunities and Constraints
Clinical narratives provide nuanced insight into patient conditions, physician assessments and care decisions. Generating synthetic medical text requires handling linguistic variability, contextual coherence and embedded domain knowledge. Of the 13 reviewed studies in this area, most employed large language models (LLMs), particularly GPT-style architectures, which have shown considerable success in producing coherent synthetic notes across multiple languages. These models demonstrated strong potential in both privacy preservation and addressing data scarcity. Yet, LLMs come with substantial computational costs and limitations in complex reasoning, which affects the accuracy and reliability of synthetic clinical narratives.
Chain-of-thought prompting has been proposed to enhance reasoning in generated text, but its effectiveness in multi-modal healthcare contexts remains inconclusive. Furthermore, the reproducibility of results is often hampered by inaccessible codebases and undocumented hyperparameter choices. While synthetic clinical text can support de-identification, enhance named entity recognition and enrich underrepresented clinical scenarios, achieving consistent quality and reliability remains a work in progress.
The generation of synthetic health records represents a critical enabler for data-driven healthcare, offering viable solutions to the persistent challenges of data scarcity, class imbalance and patient privacy. This scoping review reveals that while generative adversarial networks dominate time series modelling, longitudinal data benefit from probabilistic and graph-based methods and clinical texts are best served by large language models. However, across all modalities, there is a clear need for robust, standardised performance metrics that address fidelity, utility and privacy.
Current gaps in evaluation, data generalisability and model reproducibility limit the immediate application of SHRs in clinical practice. Bridging these gaps requires methodological refinement, regulatory alignment and interdisciplinary collaboration between technologists, clinicians and policymakers. As synthetic data generation matures, it holds the promise of transforming digital medicine by making high-quality, privacy-respecting data widely accessible for innovation and research.
Source: npg digital medicine
Image Credit: Freepik