Growing data availability has accelerated research and development, yet healthcare remains constrained by privacy and confidentiality requirements that limit the use of patient information. Synthetic data generation (SDG) offers a way to share realistic yet non-identifiable data, which is particularly relevant for longitudinal patient data (LPD) that capture trajectories over time. A systematic review conducted under PRISMA guidelines across five databases up to May 2024 identified 39 methods for generating synthetic LPD and examined how they handle temporal structure, mixed data types, missingness and unbalanced follow-up. Most work emphasised resemblance to real data, many assessed utility, and just over half evaluated privacy, though few addressed all three together. Four methods met all key LPD generation challenges, but none included privacy-preserving mechanisms. Small-sample performance remains unclear and non-standardised evaluation hinders comparison. Priorities include privacy-preserving techniques, robust and consistent evaluation, accessible code and clearer regulatory direction.
Scope and Methodological Landscape
The review screened 11 307 records, removed 2 027 duplicates and assessed 8 605 titles and abstracts, leading to 517 full-text screenings, 36 publications from 2016–2024 met the eligibility criteria, including three added from reference lists. The most common objectives centred on privacy-preserving data publishing, and the predominant fields were computer science and medicine. Risk assessments noted potential performance bias where evaluation detail or code was insufficiently described, reporting bias where planned results were not presented and variability in precision and notation for items such as sample sizes, metrics and privacy budgets. Documentation of training processes was often incomplete or absent, and conflicts of interest were inconsistently reported.
Must Read: Spatial and Synthetic Data Power Healthcare Digital Twins
Methods were grouped by operating principle into generative adversarial networks, language models, variational autoencoders, probabilistic graphical models and other approaches. Of 66 total methods identified, 39 were primary methods used in the included studies and 27 served as references. Most primary methods targeted standard LPD combining static and time-varying variables, with eight focused on sequence data and six on trajectories. All primary methods allowed dataset-specific training or customisation except one rule-based generator, a quarter required expert knowledge. Temporal structure was addressed through recurrent or convolutional layers in adversarial models, constrained dependencies in graphical models, autoregressive latent designs in variational autoencoders and transformer-based self-attention in language models. Other approaches relied on Markov assumptions or ordered conditional generation.
Capabilities and Limitations of Generators
Key LPD challenges include unbalanced follow-up, missing values and mixed variable types. Information on unbalanced data handling was available for fewer than half of the primary methods, among those, the majority could generate unbalanced sequences, typically via sequence-to-sequence techniques. Only a small subset could generate missing observations, and for several it was unclear whether missingness reflected true patterns rather than unbalanced structure. One method was confirmed to jointly generate unbalanced data while modelling missingness using a masking strategy. Many approaches required imputation before generation. All primary methods could represent categorical variables, often via one-hot encoding, and most handled numerical variables, sometimes after discretisation.
Privacy mechanisms were implemented in about one fifth of methods, including differential privacy, optimisation penalties and noise perturbation. However, the four methods that addressed temporal structure, heterogeneous variables, missingness and unbalanced data did not incorporate privacy-preserving mechanisms. Source code was available for the majority of primary methods, with some provided elsewhere or upon request, though a quarter offered neither code nor pseudocode. Python predominated, followed by R, and system requirements were rarely detailed. Deep learning methods captured complex patterns but demanded substantial data and careful preprocessing, and they struggled with missingness and unbalanced observations. Language models showed progress on unbalance and missingness but typically required very large samples and significant compute and often lacked built-in privacy guarantees.
Evaluation of Resemblance, Utility and Privacy
Evaluation practices varied. Most studies generated a single synthetic dataset, a minority produced multiple datasets or explored hyperparameters, and many compared against reference methods. The MIMIC-III dataset was the most frequently used benchmark. Resemblance was commonly assessed, spanning univariate, bivariate, multivariate and temporal domains. Univariate and bivariate checks relied on visual and correlation comparisons, with occasional statistical tests. Multivariate assessments used dimensionality reduction, discriminators and model-based comparisons. Temporal preservation was increasingly examined, yet often through qualitative plots, quantitative checks such as autocorrelations and transition matrices were less frequent.
Utility was evaluated in about three quarters of studies. Some compared statistical inferences between synthetic and original data using models suited to longitudinal analysis, while others focused on predictive performance for classification or forecasting tasks using models such as logistic regression, random forests and recurrent networks. Studies typically focused on one of these utility perspectives rather than both, leaving gaps in understanding how synthetic LPD support inference and prediction simultaneously. Privacy evaluation featured in just over half of studies and covered membership, identity and attribute disclosure, using distance or classifier-based attacks and linkage or inference tests. The choice and interpretation of privacy metrics depended on data formatting and assumptions about within-subject correlation, highlighting the need for careful, context-aware assessment.
Synthetic LPD methods are advancing, but capability and evaluation are uneven. Only a few methods address temporal structure, unbalanced follow-up, mixed variable types and missingness together, and none of these integrate privacy-preserving mechanisms. Resemblance checks are common but skew qualitative, utility assessments seldom bridge inference and prediction, and privacy evaluation remains inconsistent. Priorities include embedding robust privacy safeguards, standardising evaluation across resemblance, utility and privacy, improving transparency through publicly accessible code and clarifying regulatory expectations. These steps would strengthen confidence in synthetic LPD for research, development and education while protecting confidentiality.
Source: Journal of Healthcare Informatics Research
Image Credit: iStock