Early and reliable separation of benign, borderline and malignant ovarian tumours guides timely treatment and the choice of surgery. Ultrasound (US) is the first step in most pathways, yet readings can vary between clinicians and there are no widely confirmed criteria for borderline disease. A multi-centre project in China developed a deep learning (DL) assistant that analyses US images and adds brief large language model (LLM) descriptions to help with interpretation. Built on pathologically confirmed cases and tested across different hospitals and scanners, the assistant aimed to support day-to-day decisions rather than replace clinical judgement. In development and independent testing, it raised accuracy and consistency for primary readers to a level close to an expert and performed better than a commonly used risk model.
Data, Training and What the System Delivers
The project brought together US images from several hospitals, with each case matched to a pathological diagnosis. Data from three institutions in one region were used to build and tune the assistant. Images from other hospitals in different regions were kept separate for a final check, ensuring that people in the test set did not appear in development. Each case was labelled benign, borderline or malignant, and pathology served as the reference throughout.
Must Read: Ultrasound Accuracy in Ovarian Cancer Staging
The team evaluated different visual approaches to classify tumours directly from images and selected the option that proved most stable on internal checks. The assistant outputs a probability for each of the three categories so that clinicians can see how confident the system is. To make the result easier to use, an LLM turns those probabilities into short, non-diagnostic descriptions of what is seen on the image, without stating a conclusion. Together these outputs provide clear signals while keeping the final call with the reader. When challenged with images from hospitals not involved in development, performance held up, suggesting the approach can travel across sites and equipment.
How Assistance Changed Reader Performance
To understand impact on practice, five primary US doctors and one expert read the same image sets several times. The clinicians first worked unaided, then repeated their assessments with help from the assistant. They focused on separating benign from malignant lesions in line with standard risk stratification, while borderline lesions were left out of this specific reader comparison because agreed ultrasound rules for that category are lacking.
With assistance switched on, every primary reader became more accurate at deciding between benign and malignant and showed higher agreement with themselves over time. Consistency across the group also improved. These gains appeared not only where the assistant had been developed but also when images came from other hospitals, a setting closer to real-world variation. In parallel, the image-based approach outperformed a commonly used risk model that relies on a mix of clinical and imaging predictors. While the reader exercise focused on two classes, the assistant itself was trained to recognise three, including borderline, and it maintained that capability on independent data. This matters for surgical planning, as a confident separation between categories can influence the extent of intervention and options for fertility preservation.
Interpretability, Scope and Practical Use
Adoption depends on clarity as much as accuracy. The assistant does not deliver a single opaque score. Instead it shows category probabilities and an accompanying plain-language description that mirrors how clinicians talk about images. The description is intentionally non-diagnostic, reinforcing that responsibility for the final decision remains with the professional. This design aims to support trust, reduce friction at the point of care and help align machine outputs with routine reporting.
The project was built on static grayscale US images from multiple vendors. Images were anonymised and device details were not available, so performance could not be broken down by scanner model. The data strategy allowed more than one image per patient when building the assistant, which reflects how multiple stills are captured in clinics. To counter any overlap in development, the independent test set was kept separate at the patient level and drawn from other regions, offering a stronger check of generalisability. The reader exercise was retrospective and excluded borderline lesions, which narrows the comparison but avoids forcing a category without widely accepted ultrasound criteria.
The work also highlights limits that shape next steps. All cases came from one country, and only grayscale stills were used, so moving clips, colour flow and additional metadata were not considered. Even so, the consistent lift for primary readers, together with performance that travelled across hospitals, suggests the assistant can slot into everyday workflows. It is designed to inform rather than overrule and to present its reasoning in a way that is easy to follow.
An ultrasound DL assistant that couples image-based classification with brief LLM descriptions improved the accuracy and consistency of primary readers when separating benign from malignant ovarian tumours and sustained three-class recognition that includes borderline disease. Benefits held both where the system was built and in independent hospitals, and the image-driven approach exceeded a commonly used risk model under the conditions tested. By pairing transparent probabilities with plain-language summaries, the assistant supports clinician reasoning while keeping decisions in expert hands. Within the stated scope and limits, the findings point to a practical route to strengthen ovarian tumour assessment in routine US practice and to a foundation for broader, prospective and multimodal validation.
Source: Insights into Imaging
Image Credit: iStock