Radiomics often treats quantitative image features as biomarkers that ought to be stable across scans and settings. Stability is commonly viewed as a prerequisite for clinical utility because unstable features are assumed to introduce unpredictable variation into models. Yet this narrow focus on individual feature stability risks overlooking how information may be distributed across many correlated descriptors and how feature interactions can drive predictive performance. Evidence drawn from simulated test–retest scenarios indicates that features deemed nonreproducible can still support strong classification and that excluding them may degrade accuracy. The findings challenge long-held assumptions about what constitutes a clinically meaningful feature set and argue for prioritising prediction and relevance over a univariable view of stability.  

 

Non-reproducible Features Can Still Predict 

A central argument is that reproducibility and predictiveness are independent properties. A feature may be highly stable yet uninformative, while another may vary under repeated measurement yet contribute essential signal once considered in combination with others. This idea is illustrated by a simple analogy: if an elephant’s presence in a house is detected by checking several rooms, the specific room may change across measurements, but the composite decision can remain accurate. Translating this to imaging, modest geometric shifts between acquisitions can alter slice-based measurements without erasing the underlying signal when the model integrates multiple complementary features.  

 

To examine this formally, experiments were performed on four datasets from the WORC collection: Lipo, Desmoid, CRLM and GIST. Images were resampled to 1 × 1 × 1 mm³, central slices were identified from expert segmentations, and neighbouring slices 3 mm above and 3 mm below were used to emulate a test–retest scenario with two-dimensional radiomics. A total of 1,015 features per slice were extracted using PyRadiomics with a bin width of 25, with z-score normalisation applied for MRI datasets. Feature reproducibility on the training set was quantified using the concordance correlation coefficient, and thresholds between 0.60 and 0.95 defined “reproducible” and “nonreproducible” subsets. Models combined LASSO-based selection with a random forest classifier and were evaluated on a held-out test set using area under the receiver operating characteristic curve as the primary metric, with concordant patterns observed for area under the precision-recall curve and F1 score.  

 

Must Read: Benchmark Datasets in Reproducible AI for Radiology 

 

Results were heterogeneous across tasks. For CRLM, models restricted to reproducible features behaved as expected and generalised best, suggesting that filtering by stability helped. In contrast, for Desmoid, nonreproducible features yielded better performance, particularly around a reproducibility threshold near 0.75. GIST showed a similar tendency up to a threshold of roughly 0.92. In Lipo, nonreproducible features followed the expected pattern and did not surpass the reproducible subset. Across all but CRLM, further observation emerged: models using only the most reproducible features did not outperform models using all features, indicating potential loss of predictive information when nonreproducible descriptors are discarded.  

 

Why Reproducibility is Hard to Pin Down 

Interpreting feature stability is complicated by the need to choose both a metric and a threshold. No feature is perfectly reproducible under noisy measurements, which forces a subjective cut-off. Whether one selects intraclass correlation or concordance correlation can change which features survive filtering, yet the consequences of these choices are underexplored. Low sample sizes further amplify uncertainty. Simulations of the simple “elephant in the house” setting reveal wide confidence intervals for average reproducibility estimates at small n, raising concerns about inferences drawn from cohorts with fewer than 50 samples. These dependencies on metric selection, thresholding, and sample size make the label “reproducible” fragile and context dependent.  

 

The broader radiomics pipeline introduces additional variability. Preprocessing and normalisation steps can materially shift feature importance, so two pipelines that perform similarly may attribute success to different descriptors. This is a manifestation of the Rashomon effect, where distinct models achieve comparable accuracy through divergent internal mechanisms. In such circumstances, privileging a narrow subset of features based on a single stability criterion may be misguided, especially when radiomic features are highly correlated due to filter banks and transformations. The net effect is that univariable assessments risk missing multi-feature interactions that make a model useful even when some individual components are unstable.  

 

Comparison with contemporary deep learning practice is instructive. Classic engineered texture and wavelet features that underpin radiomics were central to earlier image analysis but have largely been supplanted in many tasks by learned, data-specific representations. Modern interpretability work in neural networks often interrogates image regions rather than handcrafted feature vectors. Parallel directions in radiomics, such as habitat-level analyses and feature maps, may provide more informative pathways for understanding what drives prediction without overcommitting to the stability of single descriptors.  

 

Implications for Model Development and Validation 

The experimental results were obtained on internal test splits with the same distribution as training data, which leaves external generalisability uncertain. Nevertheless, convergent evidence from an external study in a related context is notable: a model trained with all features outperformed a model restricted to reproducible features on external datasets, even though the reproducibility-filtered model showed strong generalisation. This suggests that less stable textural features can add meaningful signal that may be lost by aggressive filtering. The pattern is not uniform across problems, which underscores the need for broader empirical evaluation across more datasets and tumour types.  

 

A practical takeaway is that model selection should prioritise predictive performance and clinical relevance before revisiting feature stability, not the other way round. Given the independence between predictiveness and reproducibility, a strong model justifies further analysis of which features matter and how robustly they behave across acquisition settings and segmentations. By the same logic, interpreting a weak model is of limited value, since explanations of nonpredictive systems are unreliable. The field would benefit from larger, multi-centre collections to assess when filtering by stability helps, when it harms, and how to balance robustness with information retention. Until such resources are available, blanket exclusion of non-reproducible features is difficult to defend. 

 

Treating feature stability as a gatekeeper for clinical modelling risks discarding useful information and underestimating the role of feature interactions. Evidence from simulated test–retest experiments shows that nonreproducible features can contribute to accurate classification and that performance varies across tasks. Choices of metric, threshold and sample size materially influence which features are deemed reproducible, while pipeline variability and correlated descriptors complicate attribution. A more reliable approach is to prioritise predictive performance and clinical relevance, then interrogate stability within models that demonstrably work. In radiomics, the signal may lie not in any single feature but in how many imperfect pieces fit together to illuminate the whole. 

 

Source: European Radiology Experimental 

 

Image Credit: iStock

 

 


References:

Demircioğlu A (2025) Rethinking feature reproducibility in radiomics: the elephant in the dark. Eur Radiol Exp, 9:85. 



Latest Articles

radiomics, radiomic features, feature stability, reproducibility, non-reproducible features, predictive performance, machine learning imaging, medical imaging biomarkers, test–retest radiomics, radiology AI Non-reproducible radiomic features can still drive accurate prediction, challenging stability as a prerequisite for clinical utility.