Synthetic data has become a prominent focus in healthcare policy discussions, particularly as governments seek to balance the demand for innovation with the need to protect sensitive patient information. Defined as data generated using mathematical models or algorithms, synthetic data offers a way to replicate the statistical features of real-world health records without exposing identifiable details. Its promise lies in enabling the development of artificial intelligence models that are accurate, fair and broadly applicable, while avoiding the privacy risks inherent in sharing original patient information.
In the United Kingdom, the potential of synthetic data is especially significant. The National Health Service (NHS) holds a uniquely comprehensive dataset spanning lifetimes and encompassing both primary and secondary care. This provides a powerful resource for improving healthcare through data-driven research. Yet, history has shown how easily trust can be lost when patients feel excluded or inadequately protected. NHS England’s care.data scheme, introduced in 2013 and abandoned in 2016, offers a cautionary tale of how missteps in confidentiality, consent and transparency can derail ambitious initiatives. Lessons from that failure highlight the principles essential for future synthetic data projects to succeed.
Confidentiality and the Privacy-Fidelity Trade-off
The ability of synthetic data to protect confidentiality depends on how closely it resembles the original dataset. At one end of the spectrum are low-fidelity structural datasets, which maintain the shape and format of the data but lack analytical value. At the other end are high-fidelity replica datasets that preserve conditional distributions and therefore support detailed statistical analysis. While high fidelity maximises usefulness, it also raises the risk of re-identification, as artificially generated records may still allow sensitive information to be inferred.
Care.data collapsed in part because confidentiality concerns were not adequately addressed. Despite legal provisions allowing data sharing, patient groups, professional organisations and advocacy bodies were unconvinced that risks of re-identification were low enough. The perception that safeguards were insufficient undermined trust and led to widespread opposition. For synthetic data initiatives, this history underlines the need for rigorous categorisation of datasets into different privacy risk levels, supported by clear national thresholds. Low-fidelity synthetic data may be shared more freely, while medium and high-risk datasets should be restricted to Trusted Research Environments with stringent oversight.
Such stratification not only reassures patients and clinicians but also provides practical guidance for organisations generating synthetic data. If they cannot meet the safeguards required for high-risk categories, they must instead prioritise lower-fidelity data. By matching fidelity with safeguards, initiatives can strike a balance between privacy and utility, building confidence that confidentiality will not be compromised. Without these measures, synthetic data may face the same scepticism that doomed care.data.
Consent as a Foundation for Legitimacy
Another central weakness of care.data was its approach to patient consent. The scheme relied on posters in general practices and unaddressed leaflets delivered to households, many of which were never seen or were discarded as junk mail. Even when leaflets were read, they failed to provide sufficient detail, omitting both the name of the programme and critical information about re-identification risks or the opt-out process. The result was confusion, distrust and a widespread sense that patient autonomy had been ignored.
Synthetic data initiatives cannot afford to repeat these mistakes. Legality alone is insufficient if patients feel excluded or misled. The principle of informed consent extends beyond the letter of the law to a social contract between patients and the healthcare system. For synthetic data, this means ensuring that individuals understand what synthetic data is, the benefits it may provide, the risks it carries and their rights regarding its use.
Achieving this requires proactive engagement with patients and public groups, ensuring that information is accessible, inclusive and meaningful. Communication strategies must account for different literacy levels, language barriers and preferences for how information is received. Engagement must be continuous rather than one-off, with opportunities for feedback and dialogue. When patients feel that their consent is genuinely sought and respected, initiatives gain legitimacy. Without it, synthetic data risks the same fate as care.data, undermined by opposition rooted in mistrust.
Must Read: Unlocking Healthcare AI with Synthetic Data
Transparency and Governance
A third key lesson from care.data is the importance of transparency. Much of the opposition to the programme stemmed from uncertainty over who would have access to patient information and how it would be used. The eventual restrictions imposed by the Care Act 2014 on access for marketing and insurance purposes came too late to restore public confidence. Similar controversies have arisen more recently, with criticism of NHS contracts awarded to large technology companies to manage data platforms, further highlighting public sensitivity to governance arrangements.
For synthetic data to succeed, transparency must be embedded from the outset. Patients must be clearly informed about who will generate, manage and use synthetic datasets. They should also have the option to opt out of having their data used to create synthetic records, and to decide whether commercial entities may be granted access. Such measures respect individual choice and demonstrate that patient interests are prioritised.
Equally important is the governance structure overseeing synthetic data. Assigning responsibility to a designated NHS body would provide assurance that sensitive information is being managed in the public interest. Where external partners are involved, potential conflicts of interest must be openly declared, and decision-making processes should be subject to scrutiny. Examples of publicly funded collaborative platforms that have gained support from professional associations show that trust depends not only on the technology itself but also on who controls it. Synthetic data initiatives must therefore combine technical safeguards with governance arrangements that are transparent, accountable and trusted.
The potential of synthetic data in healthcare is substantial. By enabling wider access to information while protecting patient privacy, it can help develop artificial intelligence models that are both accurate and equitable, improving outcomes across diverse populations. Yet its promise will only be realised if the lessons of the past are applied. Care.data demonstrated how easily trust can be lost when confidentiality, consent and transparency are neglected.
Future initiatives must therefore adopt a risk-based approach to confidentiality, ensure meaningful and inclusive consent processes, and commit to full transparency about data use and governance. Only by embedding these principles can synthetic data gain the trust required to succeed in the United Kingdom’s healthcare system. In doing so, it can become not another abandoned project but a tool that genuinely improves care while respecting the rights of patients.
Source: npj digital medicine
Image Credit: iStock