Healthcare systems increasingly rely on large, diverse datasets to support evidence-based practice and the development of artificial intelligence. In paediatric care, such datasets are often scarce. Children and adolescents remain underrepresented in research due to ethical, legislative, financial and relational constraints, while privacy considerations further restrict data sharing. These limitations are particularly acute in rare diseases and complex conditions, where patient numbers are low and data are fragmented across institutions. Synthetic data has emerged as a potential response to these challenges. By generating artificial datasets that replicate the statistical properties of real-world data without containing identifiable information, synthetic data offers a way to expand analytical capacity while addressing privacy and access barriers. Its growing role in paediatrics reflects broader shifts in digital health, data governance and the adoption of artificial intelligence across clinical and research settings.

 

What Synthetic Data Is and Why It Matters

Synthetic data refers to information generated by mathematical models or algorithms designed to reproduce the statistical patterns of real-world data. Rather than relying on direct extracts from electronic health records or other clinical sources, synthetic datasets are produced to resemble real data without including actual patient information. These datasets can take multiple forms, including tabular data, text and medical images and may be fully artificial or combined with selected real variables.

 

In healthcare, synthetic data is valued for its capacity to enhance dataset diversity and robustness, support model development and reduce risks to patient privacy. In paediatrics, these advantages are particularly relevant. Fragmentation of care, limited interoperability between hospital and primary care systems and ethical complexities surrounding consent all constrain the availability of usable datasets. Synthetic data offers a mechanism to augment existing resources and enable analysis in areas where real data are insufficient. Its potential extends to accelerating product testing, facilitating access to data for research purposes and enabling the development of artificial intelligence tools without exposing sensitive information.

 

How Synthetic Data Is Generated and Applied

The generation of synthetic data typically begins with real-world datasets that are harmonised and used to train a generative model. This model may be based on statistical techniques, machine learning or deep neural networks. Through training, the model learns the underlying relationships and distributions within the original data. Once trained, it produces new data points that preserve these statistical characteristics while excluding real patient records.

 

Several approaches to synthetic data generation are used in practice. Fully synthetic data is entirely artificial and offers strong privacy protection but may lose analytical detail. Partially synthetic data replaces selected sensitive variables while retaining some original data, which preserves utility but carries a residual risk of re-identification. Hybrid approaches combine real and synthetic data to balance privacy and analytical value, although they require greater computational resources. These methods are increasingly used to address data scarcity and fragmentation, particularly in contexts where collecting new paediatric data is difficult or impractical.

 

Must Read: Regulators Set Conditions for Synthetic Health Data Sharing

 

Synthetic datasets may be used independently or integrated with real data to train artificial intelligence systems for tasks such as classification or prediction. In rare diseases or neonatal care, where achieving statistical power can take many years, synthetic data can widen the analytical base and support earlier insights. Its use has expanded across clinical domains, including policy modelling, predictive analytics and the testing of digital health technologies.

 

Current Uses, Opportunities and Risks

Applications of synthetic data in paediatrics span multiple specialties. In neonatology, synthetic datasets have been used to replicate epidemiological associations while addressing privacy concerns. In ophthalmology, synthetic images have augmented datasets for retinopathy of prematurity, increasing availability for research in fragile patient populations. Neurodevelopmental conditions have also been explored, with synthetic and real data combined to support automated screening tools. Medical imaging represents a major area of activity, with synthetic magnetic resonance imaging and computed tomography supporting radiation-free imaging strategies and improved dose calculations in radiotherapy.

 

Paediatric nursing is another area where synthetic data holds promise. Early warning systems based on artificial intelligence rely on large and representative datasets to predict clinical deterioration. Synthetic data can enhance existing datasets, particularly in rare conditions, and support more accurate predictive models. By enabling more reliable automation, such systems may reduce administrative burdens and allow healthcare professionals to focus more on direct patient care.

 

Despite these opportunities, significant risks and limitations remain. Synthetic data can reproduce or amplify biases present in original datasets, affecting fairness and reliability. Ensuring data quality and accurately capturing the complexity of paediatric health conditions remain ongoing challenges. Privacy risks persist if anonymisation is inadequate, and the absence of standardised guidelines for generation and validation limits consistency across applications. Addressing these issues requires transparency in generation methods, rigorous validation and attention to the developmental variability unique to children.

 

Synthetic data represents a growing resource for paediatric healthcare and research, offering practical responses to persistent challenges in data availability, privacy and ethics. Its application across imaging, nursing, rare diseases and predictive analytics highlights its potential to support artificial intelligence and evidence-based care where real-world data are limited. At the same time, risks related to bias, quality and governance underline the need for structured evaluation and paediatric-specific guidance.

 

The relevance of synthetic data lies not only in its technical promise but also in the frameworks required to ensure its fair, safe and effective use. As digital health strategies evolve, synthetic data is positioned to play an increasingly influential role in shaping paediatric research and clinical practice.

 

Source: International Journal of Medical Informatics

Image Credit: iStock


References:

Mezzalira E, Boaro MP, Reggiani G et al. (2026) Synthetic data generation in paediatrics and paediatric nursing: what, how, and why? International Journal of Medical Informatics; 209:106236.



Latest Articles

synthetic data, paediatric research, healthcare analytics, artificial intelligence, data privacy, digital health, medical informatics, rare diseases Synthetic data supports paediatric research, AI development and privacy-safe healthcare innovation.