Synthetic tabular data generation, or synthetic data generation (SDG), is gaining prominence as an approach to broaden access to sensitive healthcare datasets while reducing reliance on identifiable records. By learning patterns from original data and producing artificial records with no one-to-one mapping to individuals, SDG is positioned as a privacy-enhancing option for research, software development and training. Regulators and commentators nonetheless warn that synthetic data is not inherently free from privacy risk, particularly when personal information is used in model training and when outputs may still enable inference. Guidance from the United Kingdom, Singapore and South Korea reflects growing consensus that SDG is subject to data protection regulation and that synthetic data can be treated as non-personal only when residual disclosure risk is demonstrably low.
Synthetic Data Use Cases and Associated Risks
Healthcare datasets hold significant value but are often difficult to access for secondary use. SDG is described as a practical way to make patterns within these datasets more widely available. Several national and programme-level initiatives already provide synthetic versions of health data, supporting research where direct access to original datasets is restricted.
The source material also outlines uses beyond research. Synthetic data can support development and testing of digital health software, including artificial intelligence and machine learning systems, without exposing personal health information. It can also be used for education and training by providing realistic data structures without relying on patient records.
Risk is divided into two broad categories. Bias is framed as a general challenge for AI and machine learning, where uneven patterns in clinical research or healthcare access can be replicated and reinforced. Ethical guidance across jurisdictions applies principles such as non-maleficence, transparency and fairness to these concerns.
Must Read: Closing the Print Security Gap in Patient Data Protection
Privacy risk is more closely tied to SDG itself. Although SDG aims to reduce identity disclosure, vulnerabilities remain possible when models overfit or when outputs support inference. This places emphasis on managing disclosure risk across the SDG lifecycle rather than assuming that synthetic data is inherently safe to share.
How Three Regulators Classify SDG and Synthetic Outputs
Guidance from the United Kingdom, Singapore and South Korea converges on the view that generating synthetic data from personal information constitutes processing under privacy and data protection law. SDG therefore falls within regulatory scope, regardless of whether consent is the lawful basis relied upon.
The source text notes that jurisdictions differ on whether SDG can proceed without new consent, but all three acknowledge that other lawful bases or exceptions may apply. In the United Kingdom, the lawful basis depends on context, and processing may be compatible with the original purpose of collection when synthetic data supports research workflows. Singapore permits SDG under consent or applicable exceptions linked to legitimate interests or research. South Korea permits SDG without separate consent but highlights that additional requirements may apply in healthcare and research contexts.
All three jurisdictions also address when synthetic data can be treated as non-personal. The determining factor is whether residual disclosure risk is very low, a judgement that remains methodologically unsettled. The United Kingdom focuses on whether personal information used in modelling can be inferred from synthetic outputs, consistent with the broader principle of considering what is reasonably likely to lead to identification. Singapore and South Korea express similar expectations and emphasise evaluation-based decisions rather than blanket classifications.
Evaluation, Ethical Expectations and Public Trust
Regulatory expectations extend beyond privacy to include quality assessment and ethical considerations. Quality evaluation seeks to determine whether synthetic data is an adequate proxy for original data, using methods aligned with intended use. These approaches assess similarity in statistical properties or performance in downstream tasks.
Ethical risks highlighted include bias, falsified data and the possibility that synthetic datasets reveal sensitive trends. The United Kingdom recommends bias detection and correction, with stronger expectations when synthetic data informs decisions with legal or health implications. Singapore notes that synthetic data can replicate sensitive patterns and expose group-level trends that may contribute to collective harm. South Korea acknowledges bias and falsified data and permits evaluation methods that include expert review.
Broader concerns relate to commercial use and public trust. The source text notes criticism that SDG could serve as a legal loophole, countered by the observation that regulators treat SDG as regulated processing when personal information is involved. It also highlights persistent public ambivalence about commercial use of health data, citing examples from Canada where transparency failures prompted corrective action even when anonymisation was deemed adequate. This underlines that regulatory compliance alone does not guarantee public trust.
SDG is a regulated activity when personal information is used, and synthetic outputs qualify as non-personal only when residual disclosure risk is very low. Expectations include evaluation of quality, bias and wider ethical considerations, reflecting that synthetic data can reflect both useful and sensitive patterns. Effective adoption requires clear purpose alignment, scrutiny of risk and governance practices that support responsible secondary use and maintain public confidence.
Source: npj digital medicine
Image Credit: Freepik