Accurate preoperative classification of bone tumours as benign or malignant supports timely treatment selection and better outcomes, yet interpretation of radiographs can be challenging, particularly for less experienced clinicians. A machine learning approach integrating radiomics features from knee X-ray images with routine clinical data was built to aid decision-making. Its diagnostic value was assessed in a multireader multicase design using an independent cohort and a standardised statistical framework. Results indicate strong stand-alone model performance and measurable gains when radiologists review cases with model assistance, with the largest improvements observed among junior readers. Transparent feature attribution further clarifies the drivers of the model’s predictions, supporting clinical trust and potential adoption.
Rigorous Design and Model Development
Retrospective radiographic and clinical data from 433 patients with knee bone tumours, collected between October 2006 and June 2023, formed the development set. Regions of interest were manually outlined on anteroposterior and lateral knee radiographs, then 488 radiomics features were extracted and normalised. Redundant and irrelevant variables were reduced using a minimum absolute shrinkage and selection operator approach, yielding a compact feature set. Two families of models were constructed across six classifiers: radiomics-only and combined models that added clinical variables including erythrocyte sedimentation rate, age, gender and movement disorders. Hyperparameters were tuned by 10-fold cross-validation, and area under the receiver operating characteristic curve (AUC) on a held-out test set guided model selection.
Both radiomics-only and combined strategies achieved strong discrimination, with AUC values exceeding 0.80 across multiple algorithms. The best performance was observed with an extreme gradient boosting combined model that incorporated 16 radiomics features and two clinical parameters, reaching an AUC of 0.905 with a 95% confidence interval of 0.841 to 0.949. Comparative analysis by DeLong’s test favoured the combined approach over radiomics alone for several classifiers, including a difference in AUC of 0.072 for extreme gradient boosting with statistical significance. These findings underscore the added value of integrating readily available clinical data with texture-based image descriptors to enhance classification of knee bone tumours.
Must Read: Improving DXA Interpretation in Bone Health Assessment
Model interpretability was addressed using SHAP analysis to rank feature contributions. Texture-derived attributes were influential, with a grey level dependence matrix metric, DependenceNonUniformity, carrying the highest weight among all predictors, while erythrocyte sedimentation rate emerged as an important clinical contributor. Grey level co-occurrence matrix descriptors also provided predictive signal. The explanation plots illustrated how feature values shifted output probabilities towards benign or malignant classes in representative cases, offering insight into the model’s behaviour at the point of care.
Reader Performance and Assistive Gains
A separate cohort of 168 patients with knee bone tumours, collected between July 2023 and May 2025, was reserved for the multireader multicase evaluation. Seven radiologists with 2 to 11 years of musculoskeletal imaging experience completed a structured training module and a qualification assessment before participating. Each reader first performed independent interpretation of radiographs alongside general clinical information to classify lesions as benign or malignant and record confidence levels. After a four-week washout, the same readers re-interpreted the cases with access to the combined model’s output and feature-level SHAP summaries. Case order was randomised for both sessions. Diagnostic performance was analysed using the Obuchowski-Rockette method, treating readers and cases as random effects to support generalisability.
Independent reading showed strong average performance, with a mean AUC of 0.904 across all seven readers. Readers with more than eight years of experience achieved AUCs exceeding 0.95 without assistance, highlighting an advantage linked to seniority. The combined model’s AUC of 0.905 was similar to the average of all radiologists and surpassed the performance of the less experienced readers whose independent AUCs were 0.812 and 0.815. With model assistance, average performance rose to an AUC of 0.941, reflecting a mean increase of 0.037 with a 95% confidence interval from 0.001 to 0.074 and a P value of 0.047. The largest gains were concentrated among junior readers, where AUC increases of 0.104 and 0.081 were observed, both statistically significant. Improvements among senior readers were smaller in magnitude, consistent with high baseline accuracy. Variance component analysis supported these conclusions and reinforced the robustness of the modality effect across the reader set.
Feature Contributions and Practical Implications
Feature attribution results delineate how image texture and a standard inflammatory marker jointly inform the model’s classification. High-impact texture measures, led by Dependence Non Uniformity from the grey level dependence matrix family, contributed materially to discrimination, suggesting that heterogeneity patterns within lesions captured on routine radiographs align with biological differences between benign and malignant tumours. Erythrocyte sedimentation rate added complementary clinical context, with higher values associated with malignancy in group comparisons. Grey level co-occurrence matrix features also carried notable weight, providing additional textural perspectives that refine predictions. Presenting these elements to readers during the assisted session offered not only a probability estimate but also an explanation of the main drivers, which can support confidence, inform reporting and facilitate discussions in multidisciplinary settings.
Limitations temper interpretation and indicate priorities for further work. Manual segmentation underpinned radiomics extraction, pointing to the potential benefit of automated detection and contouring to reduce observer dependence and streamline workflow. Although the multireader study used an independent cohort assembled over a different time window, all data originated from a single institution. External validation across multiple centres and a broader spectrum of bone tumour types would strengthen generalisability and confirm performance in diverse practice environments.
An interpretable machine learning model that integrates radiomics from knee X-ray images with routine clinical information demonstrated strong discriminatory performance for classifying bone tumours and delivered measurable gains when used as an assistive tool in a multireader setting. Average AUC improved with model guidance, with the most pronounced benefit among junior radiologists, while senior readers maintained high accuracy. Feature-level explanations highlighted the importance of texture metrics and erythrocyte sedimentation rate, offering transparent rationale for predicted classes. These results support the potential for model-assisted reading to enhance diagnostic consistency in musculoskeletal oncology, particularly in settings with limited specialist expertise, while outlining clear avenues for automation and multicentre validation to facilitate broader clinical adoption.
Source: Academic Radiology
Image Credit: iStock