Radiology departments face mounting diagnostic challenges, especially in light of increased patient volumes and radiologist shortages. Conventional radiography remains the initial modality of choice for fracture assessment, given its accessibility and cost-effectiveness. However, inherent limitations—such as modest sensitivity in certain anatomical areas and interobserver variability—mean that a significant proportion of fractures can be missed.
Artificial intelligence offers the promise of augmenting radiologist performance and reducing diagnostic errors, but many models have yet to be externally validated in real-world, demographically diverse settings. A recent study has evaluated the performance of an AI tool, initially developed and trained using radiographs from Indian medical centres, when applied to a Dutch clinical cohort. The aim is to determine the tool’s classification and localisation capabilities across multiple anatomical sites using real-world clinical data.
Validation Methodology and AI Tool Design
The retrospective study was conducted at Erasmus MC in Rotterdam and included radiographs acquired between January 2019 and November 2022. Patients were eligible if they were aged 18 or older and had undergone radiographic imaging of the appendicular skeleton. One radiograph per exam was randomly selected to ensure consistent representation of common projections. Exclusions included radiographs with surgical implants, bone tumours, calcifications near the fracture site or incomplete views. These exclusions aligned with the limitations of the AI tool’s training data.
The evaluated AI model was trained on over 1.5 million radiographs sourced from Indian hospitals and tested on an additional 200,000 images. It employed a multitask deep neural network capable of both classification (fracture detection) and localisation (fracture bounding box). The training dataset spanned 17 anatomical regions of the appendicular skeleton and was annotated by a panel of radiologists. Discrepancies were resolved through adjudication, ensuring robust training standards.
The Dutch dataset used for validation consisted of 14,311 radiographs after applying inclusion and exclusion criteria. Radiology reports served as the reference standard. Each report was converted into bounding boxes for localisation assessment. Annotators followed a strict protocol and were blinded to AI outputs. A musculoskeletal radiologist reviewed uncertain cases and a random sample to ensure consistency.
Performance Across Classification and Localisation Tasks
Patient-wise evaluation showed that the AI tool achieved a sensitivity and specificity of 87.1%, with an area under the receiver operating characteristic curve (AUC) of 0.92. This strong performance was consistent across multiple anatomical regions. Clavicle, femur and hip radiographs yielded the highest classification results, with AUCs up to 0.96. Lower performance was noted in areas such as ribs, hand and fingers, and foot and toe, reflecting the challenges associated with complex or small anatomical structures.
Fracture-wise analysis assessed the AI’s ability to localise fractures accurately. Out of 3875 annotated fractures classified as acute or sub-acute, the AI correctly localised 60%. The Intersection over Union (IoU) threshold of 0.10 was used to determine successful localisation. The highest localisation performance was for clavicle fractures (90%), followed by femur and humerus fractures. Ribs, pelvis, and foot and toe fractures showed the lowest localisation rates, with ribs at just 7%. These differences highlight the impact of anatomical complexity and fracture visibility on localisation accuracy. While some bounding boxes with low IoU still had clinical value, the variability underscores the need for refined training and annotation strategies.
Clinical Implications and Study Limitations
The study confirms the AI tool’s ability to generalise beyond its original training population, validating its use in a Western-European setting for fracture detection. However, localisation performance did not match classification performance. The variability across anatomical sites may limit its application in polytrauma cases, where accurate detection at multiple sites is essential. Although promising as a triage aid to flag likely fractures and streamline radiology workflows, further refinement is needed for widespread clinical adoption.
Must Read: Improving Rib Fracture Localisation with RibMR
Several limitations were noted. The retrospective design and single-centre data collection may limit the generalisability of findings. The use of only one projection per case likely reduced the model’s potential to detect fractures not visible in certain views. Additionally, the reference standard relied solely on routine radiology reports, without confirmatory imaging such as CT or MRI, potentially affecting the accuracy of ground truth annotations. Fractures treated or healed at the time of imaging were not included in the localisation analysis, although they were counted in patient-wise classification.
Annotation differences also presented challenges. Even slight discrepancies between the reference bounding boxes and AI outputs could lead to low IoU scores, despite clinically acceptable localisation. This highlights the importance of establishing standardised annotation protocols and evaluation thresholds for AI in fracture detection. Finally, while 21% of cases underwent expert review, including all uncertain cases, the potential for bias remains given the focus on diagnostically challenging scenarios.
The external validation of a fracture-detection AI tool trained on Indian data demonstrates strong classification performance in a Dutch clinical setting. Patient-wise sensitivity, specificity and AUC indicate the tool’s reliability in identifying fractures across a broad set of anatomical regions. However, localisation performance remains moderate and inconsistent, particularly in anatomically complex areas. These findings support the tool’s potential use as a triage support system, helping radiologists prioritise suspected fractures. Future research should aim to improve localisation precision, explore the impact of AI on reporting efficiency and validate performance in more diverse and complex clinical scenarios, including polytrauma cases and multi-view radiography.
Source: Insights into Imaging
Image Credit: iStock