Osteoporotic vertebral fractures are a major complication of osteoporosis and are frequently overlooked on routine computed tomography scans. Accurate identification and grading are clinically important, as early detection can influence patient management and treatment decisions. However, differentiating true osteoporotic fractures from non-fracture vertebral height loss or degenerative alterations remains challenging, particularly for less experienced readers. Deep learning approaches have been proposed to support fracture detection by learning complex image patterns from large datasets. A retrospective diagnostic accuracy evaluation compared four deep learning models, one CE-marked commercial algorithm and eight human raters with varying levels of experience for the detection and grading of osteoporotic vertebral compression fractures on routine CT scans. Performance was assessed at vertebral, regional and patient levels using the semiquantitative Genant scale.

 

Study Design and Cohort Characteristics

The evaluation included 3548 thoracic and lumbar vertebrae from 331 patients derived from the public Vertebral Segmentation 19 and 20 datasets. Vertebral fractures were graded according to the Genant scale, distinguishing between any fracture and moderate or severe fractures considered clinically most relevant. Reference standard readings were established by consensus between an attending neuroradiologist and an experienced neuroradiology resident, informed by clinical information.

 

Eight human raters participated, comprising two attendings, three residents and three final-year medical students. Four deep learning architectures were trained on an internal CT fracture dataset using five-fold cross-validation and evaluated as ensemble models. In addition, a CE-marked commercial deep learning algorithm, SpineQ v1.1, was assessed. Analyses were conducted at vertebral level across the whole spine and within upper thoracic, lower thoracic and lumbar subsets, as well as at patient level.

 

Among the analysed vertebrae, 190 showed any osteoporotic fracture and 139 were graded as moderate or severe. Fracture prevalence increased from the upper thoracic to the lumbar spine. At the patient level, 85 patients had at least one mild or higher-grade fracture and 74 had at least one moderate or severe fracture.

 

Vertebral-Level Performance Across the Spine

For detection of any fracture at the vertebral level, area under the receiver operating characteristic curve values were comparable for residents, attendings and SpineQ, while lower for the deep learning ensembles and lowest for students. Diagnostic accuracy was highest for SpineQ, followed by the deep learning models and residents, with attendings slightly lower and students markedly lower. SpineQ achieved significantly better performance than all other groups in generalized linear mixed model comparisons. Attendings performed significantly better than the deep learning ensembles, whereas students performed significantly worse than all groups.

 

In the identification of moderate or severe fractures, SpineQ again achieved the highest area under the curve and accuracy, exceeding 0.99 for accuracy. Attendings showed similar area under the curve values but lower specificity. Deep learning models and residents demonstrated comparable accuracy, while students remained lowest across metrics. Statistical modelling confirmed SpineQ’s superiority over deep learning ensembles and all human reader groups.

 

Agreement analyses demonstrated almost perfect concordance with the reference standard for SpineQ and high agreement for deep learning models, residents and attendings. Even in challenging vertebrae that required consensus in the reference reading, both deep learning ensembles and SpineQ maintained high overall accuracy, although performance was lower than in cases with initial reader agreement.

 

Regional and Patient-Level Comparisons

Performance varied across anatomical regions. In the upper thoracic spine, deep learning models achieved the highest area under the curve for detecting any fracture, while SpineQ achieved the highest accuracy and sensitivity. In the lower thoracic and lumbar regions, attendings and residents achieved the highest area under the curve values for any fracture detection, but SpineQ consistently achieved the highest accuracy and frequently the highest sensitivity. Students consistently demonstrated the lowest performance across all regions and tasks.

For detection of moderate or severe fractures within regional subsets, SpineQ achieved the highest area under the curve in the upper and lower thoracic spine and remained among the top performers in the lumbar region, comparable to attendings. Across most regional analyses, generalized linear mixed models confirmed a statistically significant advantage for SpineQ over deep learning ensembles and student raters, and in several comparisons over attendings and residents.

 

At the patient level, attendings, residents and SpineQ showed comparable area under the curve values for detecting any fracture, with SpineQ achieving the highest overall accuracy and sensitivity. Attendings demonstrated the highest specificity and positive predictive value. For moderate or severe fractures, SpineQ achieved the highest area under the curve and specificity, while residents showed the highest sensitivity. Statistical comparisons indicated that deep learning ensembles performed significantly worse than SpineQ for moderate or severe fracture detection, and students consistently underperformed relative to all other groups.

 

Advanced deep learning algorithms trained specifically for vertebral fracture detection demonstrated performance comparable to radiology residents and, in selected analyses, comparable to attending radiologists. The evaluated CE-marked commercial algorithm consistently achieved the highest diagnostic accuracy and frequently outperformed both deep learning ensembles and human raters across vertebral, regional and patient levels. Performance differences were most pronounced for less experienced readers. These findings indicate that dedicated deep learning tools can achieve expert-level performance in detecting and grading osteoporotic vertebral compression fractures on routine CT scans, supporting their integration into clinical workflows.

 

Source: European Radiology

Image Credit: iStock


References:

Riedel EO, Schinz D, Keicher M et al. (2026) Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans. Eur Radiol: In Press.




Latest Articles

Osteoporosis, Deep Learning, Vertebral Fractures, CT Imaging, AI radiology, spine health, fracture detection Osteoporotic vertebral fractures are a major complication of osteoporosis and are frequently overlooked on routine computed tomography scans. Accurate...