The increasing reliance on routinely collected administrative data has significantly shaped healthcare decision-making across the NHS. However, the accuracy of insights derived from these data is closely linked to the quality of the datasets. Inconsistencies, incomplete records or data entry anomalies can distort analyses and undermine the effectiveness of operational and strategic planning. To mitigate these issues, researchers developed a structured three-step data quality assurance (DQA) process. Applied to a single NHS acute provider, this approach demonstrates how targeted techniques can address data imperfections and support more reliable, evidence-based decisions at ward level.
Diagnosing and Correcting Data Flaws
The DQA process was designed as a sequential methodology comprising univariate, bivariate and imputation stages. The first step involved univariate analysis, where completeness and consistency were assessed using summary statistics such as variable entropy, modal value and cardinality. These metrics helped flag problematic variables—those either dominated by a single value or missing substantial portions of their data. For example, higher levels of care, such as Level 2 and 3 acuity scores, frequently showed missing entries. The causes varied, from omitted zeros to absent data entries entirely.
In the second step, bivariate analysis explored relationships between missingness and other variables to identify systematic patterns. A key finding was the strong link between missing acuity scores and the status variable indicating data entry timing. Where the status indicated no data entry, entire rows were missing; in other cases, partial gaps suggested shorthand for zero values. This analysis informed the dual-stage imputation strategy used in the final step. First, zeros were imputed in rows with partial missingness. Then, for complete absences, multiple imputation via the AMELIA II algorithm was applied.
By tailoring the imputation to the data’s time-series-cross-sectional structure and including variables such as ward and observation time, the technique ensured more realistic data restoration. Importantly, the transformation of discrete count variables into a pseudo-normal distribution enabled effective application of imputation algorithms that assume normality. The process restored missing values while preserving the dataset’s internal logic, allowing for more nuanced analysis than deletion-based methods would permit.
Aligning Data for Effective Analysis
After correction, datasets were aggregated at the ward-month level. This format allowed patient acuity scores to be aligned with incident reporting data. Incident outcomes were dichotomised into harmful and non-harmful categories. To control for natural variation across wards, monthly acuity levels were detrended by subtracting the previous year’s ward-specific averages. This alignment ensured that any patterns detected were not confounded by structural differences between wards.
With the cleaned and aligned data, a binomial logistic regression model was used to evaluate whether the ratio of harmful to non-harmful incidents varied with patient acuity and staffing pressures. The model also examined differences before and during the initial Covid-19 lockdown in March 2020. The analysis revealed that rising patient numbers, rather than changes in specific levels of care, were significantly associated with an increased proportion of harmful incidents. Furthermore, the impact of the lockdown was not uniform; it differed by ward, suggesting local operational pressures influenced outcomes.
Must Read: How to Improve Data Quality in Healthcare
Crucially, this ward-specific variation was not visible in models that omitted the imputation process. Without correcting the data, the analysis suggested a system-wide effect of the lockdown without differentiating between individual wards. This discrepancy illustrates how flawed data can obscure vital insights and lead to generic, potentially ineffective interventions. In contrast, the imputed model allowed identification of ‘hot spot’ wards requiring targeted support, demonstrating the operational value of robust data handling.
Implications for NHS Decision-Making
The study’s findings have important implications for how NHS organisations approach data governance and operational planning. Incomplete or inconsistent data can lead to misdirected policies and resource allocation. By implementing a structured DQA process, healthcare providers can enhance the reliability of their data-driven decisions. The three-step framework—inspection, exploration and improvement—offers a replicable approach adaptable to different contexts and datasets.
Furthermore, the application of imputation techniques allows for the inclusion of data points that might otherwise be discarded, particularly those from under-represented or underserved patient groups. These groups often experience higher levels of data incompleteness, and their exclusion could exacerbate disparities in care. By ensuring they remain part of the analysis, the NHS can promote more inclusive decision-making and equity in service delivery.
The approach also supports the growing need for operational agility. Unlike clinical trials, which rely on pre-defined protocols and static analysis plans, real-time healthcare operations require flexible methodologies that can adapt to the data at hand. This DQA process aligns well with that requirement, offering a transparent and objective means to evaluate and improve data readiness for immediate use.
A structured approach to data quality assurance is essential for deriving meaningful insights from NHS administrative data. The application of univariate and bivariate analyses, followed by tailored imputation, enables correction of incomplete or biased datasets, ultimately supporting more precise analysis. By ensuring that variations across wards and time periods are accurately represented, the DQA process strengthens the foundation for operational decisions. This methodology improves the reliability of internal assessments as well as enhances the fairness and responsiveness of care strategies. In the future, investment in data quality practices will be key to realising its full analytical potential.
Source: Health Informatics Journal
Image Credit: iStock