Artificial intelligence is increasingly embedded across healthcare research, clinical decision support and operational planning, intensifying demand for large, diverse and well-curated datasets. Access to human-generated health data remains constrained by privacy regulation, governance requirements and persistent quality issues such as incompleteness and inconsistency. Synthetic data have therefore emerged as a widely promoted solution, offering artificially generated datasets intended to replicate the statistical properties of real-world clinical information. Their use is expanding rapidly across medical AI development, from structured health records to complex longitudinal datasets. However, growing confidence in synthetic data has not been matched by equivalent assurance of their clinical validity. Concerns are rising that overreliance on synthetic data may propagate bias, obscure clinically meaningful variation and foster misplaced trust in AI outputs. These challenges carry direct implications for healthcare leaders responsible for governance, safety and equity in data-driven systems.
Limits of Scaling Synthetic Data
The appeal of synthetic data is closely tied to the belief that increasing data volume improves model accuracy and robustness. In healthcare, this assumption overlooks the structural origins of bias embedded in clinical data, including historical decision making, societal inequities and systematic exclusions during data collection. Scaling datasets, whether real or synthetic, does not resolve these issues and can instead amplify them. Synthetic data generated from biased sources reproduce the same distortions while giving an impression of statistical strength.
Even rigorously collected datasets, such as those derived from clinical trials, face limitations due to selective recruitment and under-representation of specific populations. Synthetic data generated from such sources can create unrealistic expectations about their ability to generalise across diverse patient groups. In addition, synthetic data generation often fails to preserve rare but clinically significant cases, including uncommon disease presentations or atypical treatment responses. Maintaining accurate relationships across intersecting demographic and clinical variables remains particularly challenging, leading to artificial associations that distort genuine clinical patterns.
Must Read: Regulators Set Conditions for Synthetic Health Data Sharing
Healthcare data complexity further complicates reliance on synthetic scale. Patient records integrate laboratory values, medications, clinical notes, billing data and information from non-clinical sources. Each element contributes to clinical context and decision making. These constraints persist regardless of dataset size, underscoring the need for careful validation of synthetic data against specific clinical use cases rather than assuming that scale alone delivers reliability.
Bias Amplification and Clinical Risk
Synthetic data tend to replicate and intensify biases present in their source datasets. When original data omit certain populations, clinical scenarios or interactions, synthetic counterparts are likely to mirror these omissions. Artificial expansion of small samples into large synthetic datasets can create a false sense of diversity and reliability, particularly when heterogeneous populations are represented by limited real-world examples. Such practices risk misleading clinicians and policymakers by overstating the applicability of AI-driven insights.
At the patient level, synthetic datasets frequently struggle to represent nuanced clinical detail. Rare conditions, intersectional identities and complex comorbidities are often smoothed over or distorted, limiting suitability for high stakes modelling where granular accuracy is essential. Although synthetic data may comply with privacy regulations, their inability to preserve clinically significant detail renders them unreliable for certain decision-making contexts.
Privacy risks also persist. Synthetic data derived from small or poorly curated datasets can reproduce rare combinations of variables, inadvertently exposing sensitive patient characteristics. At the same time, these distortions can undermine scientific validity by misrepresenting disease heterogeneity and clinically relevant relationships. Without transparent labelling, restricted use and rigorous validation, synthetic data risk becoming efficient for privacy protection but unsafe for guiding care delivery.
Synthetic Trust and the Shift from Quantity to Quality
The growing normalisation of synthetic data has contributed to the emergence of synthetic trust, defined as unwarranted confidence in AI models trained on artificially generated datasets. This trust is reinforced by cognitive biases that favour advanced algorithms and automated outputs, even when underlying data lack traceability or transparency. Synthetic datasets can create an illusion of completeness, masking gaps that disproportionately affect under-represented populations and leading to systematic misestimation of risk.
Addressing these concerns requires a decisive shift from data quantity to verifiable data quality throughout the AI lifecycle. High-quality healthcare data must demonstrate completeness, representativeness and preservation of clinically meaningful relationships. Synthetic data demand additional scrutiny to assess fidelity, detect artificial associations and evaluate loss of rare but critical cases. Continuous incorporation of real-world data is essential to prevent degradation of model performance over time.
Robust data hygiene practices play a central role, including removal of implausible values, careful handling of missing data and harmonisation of variables across datasets. Clear differentiation between synthetic and real data, combined with transparent documentation of generation methods and limitations, is necessary to support accountability. Governance frameworks that define institutional responsibilities and enforce ongoing validation are increasingly critical as synthetic data become embedded in clinical AI systems.
Synthetic data offer practical benefits for addressing privacy constraints and data scarcity in healthcare AI, but their uncritical adoption introduces substantial risk. Bias amplification, loss of clinically significant detail and erosion of trust threaten the reliability and equity of algorithmic decision making. Healthcare professionals and decision-makers must prioritise rigorous validation, transparency and quality safeguards to ensure synthetic data support safe and effective care. The long-term value of synthetic data lies not in their scale or convenience but in their ability to preserve the complexity, diversity and ethical standards essential to responsible medical innovation.
Source: Lancet Digital Health
Image Credit: iStock