Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

1University of Tübingen, Germany
Teaser figure showing SAM3 vs ReMeDI-SAM3 comparison

SAM3 vs ReMeDI-SAM3 (Ours): The orange-labeled instrument gets occluded after T=153 and re-appears at T=165. While SAM3 produces false positives after its re-appearance, ReMeDI-SAM3 maintains consistent identities across occlusion and re-entry.

Abstract

Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions.

We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches.

Method Overview

ReMeDI-SAM3 Pipeline

We propose ReMeDI-SAM3: Refined Memory for Disambiguation of Identities with SAM3, a training-free extension of SAM3 to enhance temporal consistency and identity preservation in surgical videos. The pipeline is shown above. Our approach (1) restructures SAM3 memory into two components: (i) a relevance-aware memory that admits only high-confidence frames to stabilize temporal propagation, and (ii) an occlusion–aware memory that selectively retains pre-occlusion appearance cues to facilitate identity recovery. Building upon this design, we further introduce (2) a novel memory expansion scheme based on piecewise interpolation of temporal positional encodings, enabling larger memory without retraining, and (3) a feature-based re-identification module with temporal voting for robust post-occlusion identity verification and correction. Together, these components improve robustness to occlusions, tracking stability, and reliable instrument re-entry handling in surgical videos.

Results

Our method achieves state-of-the-art results on both EndoVis17 and EndoVis18 benchmarks under zero-shot settings, outperforming both vanilla SAM3 and prior training-based approaches.

EndoVis17

Method Challenge IoU IoU mcIoU
ISINet 55.62 52.20 28.96
S3Net 72.54 71.99 46.55
MATIS Frame 68.79 62.74 37.30
TP-SIS 63.37 63.37 52.74
TrackAnything 67.41 64.50 62.97
SurgicalSAM 69.94 69.94 67.03
SP-SAM 73.94 73.94 71.06
MA-SAM2 (Zero-Shot) 62.49 62.49 59.89
SAM3 (Zero-Shot) 71.32 71.32 68.79
ReMeDI-SAM3 (Ours) 78.57 78.57 75.65

EndoVis18

Method Challenge IoU IoU mcIoU
ISINet 73.03 70.94 40.21
S3Net 75.81 74.02 42.58
MATIS Frame 82.37 77.01 48.65
TP-SIS 84.92 83.61 65.44
TrackAnything 65.72 60.88 38.60
SurgicalSAM 80.33 80.33 58.87
SP-SAM 84.24 84.24 65.71
SAM3 (Zero-Shot) 88.04 81.82 66.46
ReMeDI-SAM3 (Ours) 88.24 87.46 82.23
Main results comparison

Qualitative comparison of SAM3 and ReMeDI-SAM3 on a challenging occlusion and reappearance case in EndoVis17. After the orange-labeled instrument becomes fully occluded at T=44, SAM3 exhibits identity drift and incorrectly assigns the orange identity to the visible green instrument, with this mislabeling persisting across subsequent frames. In contrast, ReMeDI-SAM3 suppresses such false-positive identity propagation during the occlusion and correctly re-identifies the true instrument upon reappearance.

EndoVis17 qualitative results

Qualitative comparison on EndoVis17 showing instrument turnover. The orange instrument (Bipolar Forceps) exits the scene after T=75, and a second red instrument (Prograsp Forceps) enters later (T=126-132). ReMeDI-SAM3 initially misses the new instrument (T=126) as it's not clearly visible but subsequently correctly recovers and assigns the red identity once sufficient evidence is available. In contrast, SAM3 incorrectly preserves orange identity after occlusion, continuing to label the new red instrument as orange.

Acknowledgements

The work described in this paper was conducted in the framework of Graduate School 2543/1 "Intraoperative Multi-Sensory Tissue Differentiation in Oncology" (project ID 40947457) funded by German Research Foundation (DFG - Deutsche Forschungsgemeinschaft). This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) – EXC number 2064/1 – Project number 390727645. The authors thank International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Valay Bundele and Mehran Hosseinzadeh. We also thank Jan-Niklas Dihlmann for redesigning the pipeline figure.