Artificial intelligence is becoming increasingly embedded in computer-assisted intervention, supporting preoperative planning, intraoperative guidance and postoperative evaluation. Reliable interpretation of surgical scenes requires understanding both anatomical structures and the sequence of actions unfolding during procedures. Achieving this capability has been challenging because annotated surgical video data remain limited and costly to produce across the wide range of procedures represented in clinical taxonomies such as SNOMED-CT. Self-supervised learning approaches have reduced annotation requirements in medical imaging, yet many surgical models still rely on static image pre-training and treat temporal modelling as a later step, restricting their ability to capture the dynamic context of operative workflows and tissue interaction.

 

Large-Scale Surgical Video Data Integration

A large-scale surgical video corpus was assembled to support joint spatial and temporal representation learning. The dataset contains 3,650 videos and approximately 3.55 million frames covering more than 20 procedures and over 10 anatomical structures. It combines curated public surgical datasets with additional material collected from publicly accessible surgical recordings to expand procedural diversity.

 

One component, SurgPub, integrates eight public surgical video datasets and contributes 274 videos comprising 1.20 million frames. The second component, SurgWeb, adds 3,376 videos with approximately 2.35 million frames sourced from publicly available online surgical recordings. These videos span multiple specialties including hepatobiliary, colorectal, upper gastrointestinal, gynaecologic and urologic laparoscopy. A structured de-identification pipeline removed non-endoscopic views, audio tracks, textual overlays, patient faces and other potentially identifying information. The resulting dataset contains only intraoperative endoscopic footage without identifiable patient data, providing large-scale material suitable for video-level model development.

 

Must Read:Hybrid Intelligence for Clinical AI

 

The dataset was designed to support learning across heterogeneous surgical contexts while reducing dependence on manual annotation. By combining curated datasets with broader procedural coverage, it enables representation learning that reflects both anatomical variation and procedural diversity encountered in operating environments.

 

Video-Level Pre-Training Architecture

A surgical video foundation model, SurgVISTA, was developed to learn spatial structure and temporal dynamics simultaneously during pre-training. The model uses a reconstruction-based learning strategy with an asymmetric encoder–decoder architecture. A unified encoder captures joint spatiotemporal dependencies from surgical video clips, while two decoders support complementary learning objectives.

 

One decoder reconstructs masked regions of video sequences, encouraging the model to infer spatial relationships and temporal continuity within surgical activity. The second decoder applies image-level knowledge distillation guided by a surgery-specific expert model to preserve fine anatomical detail and semantic information that may be weakened during temporal abstraction. This dual-objective design supports integration of spatial and temporal learning within a single pre-training framework.

 

Video clips are constructed through uniform frame sampling and processed using a transformer-based architecture with joint spatiotemporal attention. Masked reconstruction loss and distillation loss are combined during optimisation to guide representation learning. The approach is intended to reduce the separation between pre-training and downstream adaptation by embedding temporal reasoning directly into the learned representations rather than adding temporal modules later.

 

Performance Across Surgical Understanding Tasks

Evaluation was conducted using a benchmark covering 13 datasets, six surgical procedures and four categories of surgical scene understanding tasks. The benchmark includes more than 500 long-duration surgical videos and over 600 short clips, representing more than 660,000 frames. Tasks include surgical workflow recognition, surgical action recognition, surgical triplet recognition and surgical skill assessment.

 

Comparisons with natural-domain pre-trained models show that video-level pre-training consistently improves workflow recognition performance relative to image-level pre-training. On the Cholec80 dataset, video-level approaches achieved gains of 1.5% in both image-level and video-level accuracy and 2.8% in phase-level Jaccard. Larger improvements were observed on the out-of-domain Cataract-101 dataset, with increases of 10.8% in image-level accuracy, 10.7% in video-level accuracy and 19.6% in phase-level Jaccard.

 

SurgVISTA outperformed natural-domain video-level models across both in-domain and out-of-domain evaluations. On laparoscopic hysterectomy workflow recognition, despite only 1.13% procedural overlap between pre-training and evaluation data, improvements reached 4.7% in image-level accuracy, 4.6% in video-level accuracy and 7.3% in phase-level Jaccard. On an endoscopic submucosal dissection dataset, gains of 2.8%, 2.7% and 5.4% were observed across the same metrics.

 

Additional comparisons with surgical-domain foundation models demonstrated improvements across multiple workflow datasets. On Cholec80, the model achieved 91.5% image-level accuracy and 91.5% video-level accuracy, with phase-level precision of 87.3%, recall of 87.7% and Jaccard of 78.1%. On M2CAI16-Workflow, increases of 1.1% and 1.7% were observed in image-level and video-level accuracy, alongside larger improvements in phase-level precision, recall and Jaccard. Similar performance gains were reported on AutoLaparo and cataract workflow datasets.

 

Scaling experiments showed that performance improved as pre-training data volume increased across nine downstream datasets, including settings with limited procedural overlap. Ablation analysis of the knowledge distillation objective indicated consistent improvements in accuracy and Jaccard metrics across multiple datasets and pre-training configurations.

 

Video-level surgical representation learning that jointly models spatial anatomy and temporal activity enables improved performance across multiple surgical scene understanding tasks. A large and diverse surgical video corpus supports scalable pre-training, while a dual-objective architecture integrates masked video reconstruction with surgery-specific knowledge distillation. Reported results demonstrate consistent improvements across workflow recognition and related tasks, including evaluations involving new procedures and limited data overlap. The framework establishes a foundation for further development of surgical video models and highlights the importance of large-scale datasets, temporal modelling and specialised pre-training strategies for advancing artificial intelligence in computer-assisted intervention.

 

Source: npj digital medicine

Image Credit: iStock


References:

Yang S, Zhou F, Mayer L et al. (2026) Large-scale self-supervised video foundation model for intelligent surgery. npj Digit Med: In Press.



Latest Articles

surgical AI, computer-assisted intervention, surgical video dataset, workflow recognition, temporal modelling, spatiotemporal learning, SurgVISTA AI in surgery transforms workflow recognition with large-scale video data and video-level pre-training for accurate spatial and temporal analysis.