Clinical prediction models increasingly inform how likely outcomes are discussed for individual patients during clinical decision-making. A single predicted risk can look precise, yet the uncertainty around that number is rarely visible in routine reporting and can differ substantially between patients. When uncertainty varies across individuals, a model may appear sufficiently certain for some patients while remaining much less reliable for others, raising questions about trustworthiness and fairness. Standard performance metrics do not capture this patient-level uncertainty, so there is the need for practical ways to quantify and communicate it alongside discrimination and calibration.
Sampling Uncertainty and Uneven Representation
Sampling uncertainty arises because prediction models are developed on datasets of finite size. Predicted risks can shift if a model is trained on a different sample drawn from the same population, and uncertainty generally falls as development sample size increases. Instability is especially concerning when model complexity is high relative to the information available. Non-parametric bootstrap resampling has been proposed to evaluate instability during model development, but uncertainty also matters when a prediction is used for a specific patient.
Must Read: The Impact of Sample Size on AI Prediction Models
Patients are not represented equally in development data, so sampling uncertainty can vary widely between individuals. Effective sample size captures this by expressing how many similar patients, with similarity defined by the model, were effectively represented in the development sample for a given predictor profile. A low effective sample size indicates a predictor combination that is far from average compared with the development dataset and a prediction that may be less stable for that individual. In discussions of algorithmic fairness, low effective sample sizes can flag where representation may be limited in groups characterised by variables that capture underrepresentation.
Estimating Effective Sample Size Across Model Types
Effective sample size is defined by equating two variances. The variance of a patient’s predicted risk is matched to the variance of the mean outcome in a hypothetical independent sample of similar patients whose outcomes were directly observed. This yields a ratio of the outcome variance conditional on the patient’s predictors to the prediction variance. For binary outcomes, the outcome variance can be estimated by substituting the predicted risk into the Bernoulli variance function, giving predicted risk multiplied by one minus the predicted risk.
Prediction variance can be obtained analytically for some models, such as generalised linear models (GLMs), but a computational route is needed for many machine learning approaches. Bootstrap procedures estimate prediction variance by repeatedly generating datasets, refitting the model and measuring variability in predicted risks, with at least 200 bootstrap iterations recommended for stability assessment. Non-parametric bootstrap draws samples with replacement from the original data, yet formal guarantees do not apply for many machine learning models, particularly tree-based methods. Parametric simulation-based bootstrap generates samples from a parametric model fitted to the original data and requires milder conditions for consistency.
Simulations and GUSTO Results Highlight Patient-Level Gaps
The approach was illustrated with the Global Utilisation of Streptokinase and Tissue Plasminogen Activator for Occluded Coronary Arteries (GUSTO) dataset of patients with acute myocardial infarction (N=40,830). Death within 30 days occurred in 2,851 patients (7.0%). The dataset was split into a USA development sample (n=23,034) and an external validation sample collected elsewhere (n=17,796).
Five model types were fitted using the same candidate predictors: logistic regression, elastic net, XGBoost, neural network and random forest. Simulations approximated true effective sample sizes by generating outcomes from a fitted model, refitting models to simulated data and comparing bootstrap-derived effective sample sizes with values based on simulated risks and prediction variances. Consistency for logistic regression used 1,000 simulations with 500 non-parametric bootstrap resamples on the full GUSTO US dataset. Across all five models, 100 simulations were run with 200 bootstrap iterations using a subsample of 2,812 patients including 900 events.
Bootstrap-based and analytical formula-based effective sample sizes were highly similar for logistic regression, and both remained unbiased for the true effective sample size. Effective sample sizes for the elastic net were, on average, close to true values. Effective sample sizes were somewhat overestimated for XGBoost but within the same order of magnitude. Non-parametric bootstrap substantially overestimated effective sample sizes for the random forest and underestimated values for the neural network, whereas parametric bootstrap produced more accurate estimates across model types.
In the clinical illustration, prediction variance was estimated using parametric bootstrap with 200 iterations. External validation showed similar discrimination, with c-statistics ranging from 0.82 for the elastic net, XGBoost, neural network and random forest to 0.83 for logistic regression. Calibration intercepts ranged from 0.06 for random forest to 0.12 for XGBoost. Calibration slopes were between 1.00 and 1.02 for the elastic net, logistic regression and neural network models, while XGBoost had a slope of 0.96 (95% CI 0.91–1.01) and random forest had a slope of 1.04 (0.99–1.10).
Effective sample sizes varied widely across models in the GUSTO US dataset. Logistic regression had a median effective sample size of 2,532 with a minimum of 12, while the elastic net had a median of 1,930 with a minimum of 59. The neural network had a median of 193 (IQR 127–326) with a range from 2 to 942. The random forest produced a median of 4 (IQR 3–6) with a range from 1 to 1,643. The XGBoost model had a median of 353 with a range from 3 to 2,595. For logistic regression, values exceeding the total development sample of 23,034 were observed and were limited to patients with extremely low predicted risks.
The contrast between similar model-level metrics and divergent effective sample sizes indicates that individual prediction uncertainty can remain substantial even with large development samples. Effective sample size can be examined during model evaluation to identify patient types with low stability, and can support risk communication where users want uncertainty to be communicated.
Effective sample size provides an interpretable way to express sampling uncertainty for individual risk predictions, including in machine learning models. Across five model types applied to 30-day mortality prediction in acute myocardial infarction, discrimination and calibration looked broadly similar while individual predicted risks and effective sample sizes varied widely, including a median effective sample size of 4 for a random forest model. The approach relies on estimating prediction variance and can be computationally intensive, with non-parametric bootstrap lacking formal guarantees for many machine learning methods. Even so, expressing uncertainty as an effective number of similar patients offers a practical route to make prediction uncertainty more transparent at the level where risk estimates are used.
Source: The Lancet Digital Health
Image Credit: iStock