Researchers in the United States trained a convolutional neural network (CNN) – a deep learning technique – to detect signs of pneumonia in chest x-rays from three large U.S. hospital systems, individually and in combination. Overall, the researchers found that the CNN models showed a significant reduction in accuracy when tested on data from outside the training set.
Eric Karl Oermann of the Department of Neurological Surgery, Icahn School of Medicine, New York, and co-researchers attempted to validate the trained models' performance internally (with held-out test data) and externally (using test data from a different hospital system).
The research team reported these key findings:
The CNN trained on data from Mount Sinai Hospital (MSH; 42,396 radiographs from 12,904 patients) had an area under the receiver operating characteristic curve (AUC) of 0.802 (95% confidence interval 0.793-0.812) in held-out internal validation; an AUC of 0.717 (95% CI 0.687-0.746) in data from the National Institutes of Health Clinical Center (NIH; 112,120 radiographs from 30,805 patients), and an AUC of 0.756 (95% CI 0.674-0.838) in data from Indiana University Network for Patient Care (IU; 3,807 radiographs from 3,683 patients).
Due to limitations of the datasets, and because the features and interrelationships by which CNNs predict outcomes are not easily reduced to simpler, familiar terms, Oermann and co-authors cannot fully assess what factors other than disease prevalence might have led to reduced performance in external validation. The authors say hospital system and department characteristics may have contributed to confounding.
Nonetheless, the study provides evidence that estimates of real-world CNN performance based on held-out internal test data can be overly optimistic.
Image Credit: StockSnap, Pixabay