Medical diagnosis models in imaging are often developed as single systems intended to serve all patients. In practice, diagnostic performance can vary across demographic groups when disease presentation differs and when training data are unevenly distributed. Conventional fairness strategies typically attempt to reduce performance gaps between groups within one model, often improving outcomes for underrepresented populations while slightly reducing accuracy for groups already well served. In clinical environments, where small losses in diagnostic reliability may have consequences for care quality, this trade-off remains controversial. An alternative approach reframes fairness as the goal of maximising diagnostic performance for each patient group individually. Rather than redistributing performance, the method uses additional computing resources to support group-specific optimisation while continuing to monitor fairness metrics that capture disparity between groups.

 

Reframing Fairness as Group-Specific Performance

The proposed perspective treats fairness, accuracy and computing resources as interconnected variables. Earlier approaches generally assumed fixed computational capacity and attempted to balance average accuracy and fairness within a shared model. The alternative strategy allocates computational resources to train dedicated models for different groups, aiming to reach the highest achievable performance within each subgroup without reducing accuracy elsewhere. This approach is based on the observation that improving outcomes for lower-performing groups does not inherently require degrading performance for others.

 

Selecting appropriate training data becomes a central challenge in group-specific modelling. Training exclusively on in-group samples may lead to poor performance when subgroup datasets are small, while training on the entire dataset can introduce distributional mismatch or obscure group-specific features. Experiments in dermatology image classification illustrate these dynamics. For lighter skin groups, training on selected subsets of skin types produced stronger performance than training on all available data. Introducing out-of-group samples improved performance initially but reduced it when their proportion became too large. For darker skin groups, models trained solely on dark-skin data performed worse than those trained on lighter-skin data, indicating that larger datasets can sometimes outweigh distributional differences. These findings support selective incorporation of out-of-group data guided by empirical signals rather than uniform inclusion.

 

A Weighting Method to Use Out-of-Group Data Safely

To support group-specific optimisation, a framework called SPARE introduces sample-level weighting during training. Each training example receives a value between zero and one representing its usefulness for improving performance on a target group. This design allows models to benefit from relevant out-of-group data without introducing harmful distributional effects.

 

Must Read: Neural Reconstruction of Pulmonary Segments

 

Two signals determine the weight assigned to each sample. The first is utility, reflecting how informative a sample is for refining diagnostic decision boundaries. Samples located near classification boundaries are considered particularly valuable because they contribute to reducing misclassification risk. The second signal is similarity, representing how closely a sample aligns with the target group distribution. Similarity is estimated using a group label predictor that distinguishes between demographic groups. Samples predicted to belong to the target group are treated as introducing no distributional shift.

 

Both utility and similarity are expressed as distances to model decision boundaries. These distances are combined using a weighted formulation that balances the two signals. Smaller combined distances correspond to higher training weights, with an exponential mapping used to produce continuous values between zero and one. Distances are estimated through minimal adversarial perturbations, defined as the smallest change to an input required to alter a model prediction. This perturbation-based estimation captures local decision-surface geometry in complex neural networks more effectively than confidence-based heuristics.

 

Performance Across Datasets and Resource Trade-Offs

Evaluation was conducted on 2 skin disease classification datasets using different sensitive attributes. Fitzpatrick-17k contains more than 16,000 images representing over 100 skin conditions, with skin types grouped into light and dark categories. ISIC 2019 contains more than 25,000 images across 8 diagnostic categories, with age used to define younger and older groups. Images were resized, augmented and divided into training, validation and testing subsets, and models were trained using consistent optimisation settings across methods.

 

Across datasets, group-specific optimisation with SPARE improved classification performance for each demographic group while maintaining competitive fairness indicators. On ISIC 2019, models using the method achieved higher F1 scores for both age groups than a standard convolutional baseline while also reducing disparity metrics such as Equalized Opportunity and Equalized Odds. On Fitzpatrick-17k, the approach improved precision, recall and F1 for both light and dark skin groups compared with bias-mitigation baselines. Performance gains remained consistent when model architectures were changed, including experiments using both ResNet-18 and VGG-11 backbones.

 

Further analysis examined the contribution of weighting components. Using similarity alone generally produced stronger results than using utility alone, while combining both signals yielded the most consistent improvements across datasets. Tests replacing either signal with alternative measures reduced performance, supporting the combined weighting strategy. Continuous sample-level weighting also outperformed simpler alternatives such as binary selection rules or group-level weighting schemes.

 

Because training separate models for multiple groups increases computational demand, experiments also evaluated partial parameter sharing. Performance improved as more group-specific parameters were introduced, with fully independent models achieving the strongest results. Sharing early network layers provided a compromise in resource-constrained settings, maintaining competitive performance while reducing training cost.

 

Group-specific optimisation offers an alternative path for fairness in medical imaging systems by focusing on maximising diagnostic accuracy for each demographic group rather than redistributing performance within a single model. The SPARE framework implements this idea through sample-level weighting based on utility and similarity signals estimated from decision-boundary proximity. Experiments in dermatology image classification across two datasets demonstrate improved precision, recall and F1 for multiple demographic groups while preserving competitive fairness metrics. Computational cost remains a practical consideration, particularly as demographic categories expand, but parameter-sharing strategies provide potential efficiency gains. The results illustrate how fairness objectives in clinical AI can be pursued through performance optimisation strategies supported by targeted use of training data and computing resources.

 

Source: Medical Image Analysis

 

Image Credit: iStock


References:

Xu G, Duana Y, Xia J et al. (2026) Rethinking fairness in medical imaging: Maximizing group-specific performance with application to skin disease diagnosis. Medical Image Analysis; 109: 103950.



Latest Articles

Skin AI fairness, dermatology AI diagnosis, SPARE framework, medical imaging AI, group-specific optimisation, bias mitigation in AI, Fitzpatrick-17k, ISIC 2019 Group-specific Skin AI boosts diagnostic accuracy for all skin types using SPARE weighting, improving fairness without sacrificing reliability.