Early identification of skin cancer depends on efficient selection of lesions for expert review, yet full-body assessment is time consuming and often constrained by workforce capacity. Artificial intelligence has improved decision support for clinician-selected dermoscopic images, but that approach does not mirror real-world triage where every visible lesion must be considered.
Three-dimensional total-body photography captures the entire skin surface under standardised conditions and enables automated prioritisation at scale. An international benchmarking effort evaluated machine learning models that use these images to flag lesions for further assessment and examined which inputs most influence performance. The findings indicate that automated triage can reduce the number of lesions needing evaluation at high sensitivity while clarifying dependencies on imaging workflow, metadata and patient-level context that affect transferability and adoption.
Evaluating Automated Detection at Scale
The benchmarking assessed triage across all lesions detected on full-body images rather than pre-selected dermoscopic targets. Training data comprised about 400,000 lesion tiles from roughly 1000 patients across centres in North America, Europe and Australia, with test sets compiled from the same and additional sources. Tiles were 15 mm by 15 mm crops centred on software-identified lesions and were accompanied by basic descriptors and appearance measurements produced within the imaging workflow.
On the private evaluation set containing 370,704 lesions from 935 patients, 342 lesions were malignant, including 99 melanomas (55 in situ), 190 basal cell carcinomas and 53 squamous cell carcinomas. The leading approach achieved an area under the receiver operating characteristic curve of 0.9668 for overall skin cancer classification. Precision was expressed as the number of lesions requiring triage per malignancy detected at a given sensitivity. At 80% sensitivity the number was 51.57, and at 90% sensitivity it was 98.20. From a patient perspective, a secondary metric simulated clinical workflow by selecting the top fifteen scoring lesions per individual, under this scenario, sensitivity reached 0.7903 for patients harbouring cancer. For melanoma specifically, top systems achieved an area under the curve of 0.9704 with similar patient-level sensitivity using the same top-fifteen approach. Compared with an earlier model based on regression of a smaller feature set, performance gains were evident, including a reduction from 739.45 to 126.31 lesions per melanoma detected at 80% sensitivity in the evaluation framework.
Must Read: Automated LLM Labelling Drives Multi-Label Radiography
Role of Context and Metadata in Model Performance
Ablation analysis examined four information classes: tile images, basic metadata (age, sex, anatomical site, hospital), appearance measurements from the imaging workflow (examples include lesion area, colour contrast and border characteristics) and patient-context features that place each lesion in relation to others on the same individual. Patient context was a major contributor. Removing these features reduced the area under the curve from 0.967 to 0.956 and increased the average number of benign lesions flagged per cancer detected at 80% sensitivity from 50.57 to 72.68, indicating the value of accounting for within-patient norms.
Appearance measurements were more informative than images alone. A variant restricted to those measurements outperformed a tile-only variant (area under the curve 0.939 vs 0.922), and excluding the measurements from the full model was more detrimental than excluding tiles (0.948 vs 0.957). Adding basic metadata to appearance measurements and context further improved discrimination (0.957 vs 0.949). Together, these results support a multimodal strategy that mirrors clinical reasoning by combining visual patterns with structured information.
Exploratory correlations across high-ranking submissions provided additional interpretability. Colour characteristics showed the strongest associations with modelled risk. Redder hue within the lesion was linked to higher scores and greater redness in surrounding skin also associated with higher risk. Lower blue-yellow contrast between lesion and background, higher colour variance and greater colour asymmetry within the lesion were positively associated with risk estimates. Measures of size, including minor axis diameter, lesion area and border perimeter, showed mild positive correlations. By contrast, border irregularity and shape asymmetry exhibited weak associations. The dataset included non-melanocytic lesions, and observed colour patterns were not confined to pigment-heavy presentations.
Practical Limits, Transferability and Efficiency
Despite encouraging accuracy and precision, practical considerations temper immediate translation. Key appearance measurements used by leading models were generated within a specific three-dimensional total-body photography workflow and rely on proprietary algorithms. This dependency constrains direct application to alternative imaging systems that do not produce comparable features and underscores the need for local validation before deployment. Imaging protocols varied between centres, notably the use of cross-polarised versus white light, which affects lesion visibility and complicates performance comparisons. Hospital labels informed predictions in the evaluation set, indicating that site effects can influence outputs and that recalibration may be necessary when transferring models between settings.
Processing time is another factor for service delivery. Reported inference times for the leading approach were about 70 seconds on a graphics processor and 390 seconds on a central processor per full-body capture. These timings are feasible for research pipelines but may challenge high-volume practice without optimisation. Threshold selection also requires careful consideration. Although the models reduced the number of lesions needing assessment by large margins at high sensitivity, services must balance sensitivity against avoidable workup according to local priorities, casemix and capacity. The analysis highlighted areas for improvement in non-melanocytic lesions where risk patterns may differ from pigment-based cues.
Automated triage using three-dimensional total-body photography shows promise for focusing specialist attention on higher-risk lesions and managing large lesion burdens efficiently. Discrimination improved when image data were combined with patient-level context, appearance measurements from the imaging workflow and basic metadata, and when lesion selection was framed at patient level to prioritise a limited set for review. At the same time, reliance on specific imaging outputs, inter-centre variation and processing time requirements present practical constraints. Careful threshold setting, local validation and workflow evaluation will be essential to translate gains in triage precision into practice, while further work to enhance transferability and performance across lesion types can strengthen clinical utility.
Source: npj digital medicine
Image Credit: iStock