nach oben

International Journal of Computer Assisted Radiology and Surgery

Open Access 17.05.2024 | Short communication

Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery

verfasst von: Yuyang Sheng, Sophia Bano, Matthew J. Clarkson, Mobarakol Islam

Erschienen in: International Journal of Computer Assisted Radiology and Surgery

Abstract

Purpose

The recent segment anything model (SAM) has demonstrated impressive performance with point, text or bounding box prompts, in various applications. However, in safety-critical surgical tasks, prompting is not possible due to (1) the lack of per-frame prompts for supervised learning, (2) it is unrealistic to prompt frame-by-frame in a real-time tracking application, and (3) it is expensive to annotate prompts for offline applications.

Methods

We develop Surgical-DeSAM to generate automatic bounding box prompts for decoupling SAM to obtain instrument segmentation in real-time robotic surgery. We utilise a commonly used detection architecture, DETR, and fine-tuned it to obtain bounding box prompt for the instruments. We then empolyed decoupling SAM (DeSAM) by replacing the image encoder with DETR encoder and fine-tune prompt encoder and mask decoder to obtain instance segmentation for the surgical instruments. To improve detection performance, we adopted the Swin-transformer to better feature representation.

Results

The proposed method has been validated on two publicly available datasets from the MICCAI surgical instruments segmentation challenge EndoVis 2017 and 2018. The performance of our method is also compared with SOTA instrument segmentation methods and demonstrated significant improvements with dice metrics of 89.62 and 90.70 for the EndoVis 2017 and 2018

Conclusion

Our extensive experiments and validations demonstrate that Surgical-DeSAM enables real-time instrument segmentation without any additional prompting and outperforms other SOTA segmentation methods

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Robot-assist surgery is gaining increasing attention in the research field of intelligent robots. Some existing works apply deep learning techniques to realise instance segmentation for surgical instruments. While these models have significantly advanced instance segmentation performance on surgical datasets, they have yet to fully harness the capabilities of either the most recent segmentation models or the advanced object detection model, which presents an opportunity for further refinement and enhancement. The well-known segmentation foundation model, SAM (segment anything model) [1], and adaptations of SAM in medical image segmentation and surgical instrument segmentation [2] have shown great promise in semantic segmentation. However, they cannot produce object label segmentation, and they require interactive prompting during the deployment period, which is not realistic.

In this work, we (1) propose Surgical-DeSAM to generate automatic bounding box prompting for a decoupling SAM; (2) design Swin-DETR by replacing ResNet with Swin-transformer as image feature extractor of the DETR [3]; (3) decouple SAM (DeSAM) by replacing SAM’s image encoder with DETR’s encoder; (4) validate on two publicly available surgical instrument segmentation datasets of EndoVis17 and EndoVis18; and (5) demonstrate the robustness compared to the SOTA models.

Methodology

Preliminaries

SAM

SAM [1] is the foundation model for prompt-based image segmentation and is trained on the largest segmentation dataset with over 1 billion high-quality masks. SAM forms a simple-designed transformer and composed of a heavyweight image encoder, a prompt encoder, and a lightweight mask decoder. The image encoder can directly extract image features from input images without the need for a backbone model, while its lightweight prompt encoder can dynamically transform any given prompt into an embedding vector in real-time. These embeddings are then processed by a decoder, generating precise segmentation masks. Prompts have various types, including points, boxes, text, or masks, which limit the SAM’s ability to be directly utilised for real-world applications like surgical instrument segmentation during surgery. It is unrealistic to provide a prompt for each frame of the surgical video.

Table 1

Performance comparison of the proposed Surgical-DeSAM model and the SOTA models on EndoVis 2017 and 2018

Method	EndoVis 2017		Method	EndoVis 2018
	mIoU	DICE		mIoU	DICE
TernausNet [8]	35.27	–	TernausNet [8]	46.22	–
MF-TAPNet [9]	37.35	–	MF-TAPNet [9]	67.87	–
Dual-MF [10]	45.80	56.12	Dual-MF [10]	70.41	76.93
TrackFormer [11]	54.91	59.72	TrackFormer [11]	71.10	77.30
ISINet [7]	55.61	62.8	ISINet [7]	73.10	78.30
TraSeTR [12]	60.40	65.21	TraSeTR [12]	76.20	81.10
S3Net (+MaskRCNN) [13]	72.54	–	S3Net (+MaskRCNN) [13]	75.81	–
–	–	–	SurgicalSAM [14]	80.33	–
–	–	–	Wang et al. [15]	71.38	–
Surgical-DeSAM	82.41	89.62	Surgical-DeSAM	84.91	90.70

Surgical-DeSAM outperforms significantly

DETR

DETR [3] is the transformer-based detector called DETR (DEtection TRansformer) for object detection. It consists of a CNN backbone, an encoder-decoder transformer and feed-forward networks (FFNs). The CNN backbone is the commonly used ResNet50 [4] which extract the feature ($\in \Re ^{d\times H\times W}$) representation from the input image ($\in \Re ^{3\times H_0\times W_0}$). The output of the backbone then passes to the transformer encoder with spatial positional encoding and produces object queries and encoder memory. The decoder receives the encoder outputs and predicts the class labels and bounding boxes with centre coordinates, height and width using FFNs.

Surgical-DeSAM

As shown in Fig. 1, we proposed Surgical-DeSAM to automate the bounding box prompting by designing (1) Swin-DETR: replacing ResNet50 of the DETR with Swin-transformer to design an efficient model for surgical instrument detection; (2) decoupling SAM: Replacing SAM image encoder with DETR Encoder and training end-to-end detection to prompt mask decoder of the SAM to segment surgical instrument.

SWIN-DETR

DETR utilises ResNet50 as the backbone CNN to extract the feature representation. However, as vision-transformer-based networks are showing much better performance than CNN, we replace the backbone network with a recent transformer-based architecture of Swin-transformer [5] and from our Swin-DETR as presented in Fig. 1. The Swin-transformer introduces a shifted window-based hierarchical transformer to add greater efficiency in the self-attention computation. It is important to note that the output of the Swin-transformer can be directly fed to the DETR encoder, where there is an additional step to collapse the spatial dimension of the ResNet50 feature into one dimension to convert it into a sequence of input for the transformer. Overall, SWIN-DETR consists of a Swin-transformer to extract the image feature, which is then passed to the transformer encoder-decoder and FFNs to obtain the final object class predictions and corresponding bounding boxes. More specifically, ResNet5o requires to convert feature map of $f_\textrm{resnet} \in \Re ^{d\times H\times W}$ into $f \in \Re ^{d\times HW}$ by collapsing the spatial dimensions where Swin-transformer directly produces output feature map of $f_\textrm{swin} \in \Re ^{d\times HW}$.

Decoupling SAM

As the image encoder of the SAM and the DETR are performing similar feature extraction, we decouple SAM by removing the image encoder and feeding the DETR encoder output directly to the mask decoder. This facilitates the train end-to-end segmentation model using DETR predicted detection prompt and a decoupled SAM of prompt encoder and mask decoder only. During the training period, we utilise both ground-truths of the detection bounding boxes and segmentation masks to train both models end-to-end. To calculate losses, we adopted box loss ${\mathcal {L}}_\textrm{box}$ combining GIoU [6] and $l_1$ losses for the detection task following DETR and dice coefficient similarity (DSC) loss ${\mathcal {L}}_\textrm{dsc}$ for the segmentation task. Therefore, total loss $\textrm{Loss}_\textrm{total}$ can be formulated as:

$$\begin{aligned} \textrm{Loss}_\textrm{total} = {\mathcal {L}}_\textrm{box} + {\mathcal {L}}_\textrm{dsc} \end{aligned}$$

(1)

Experiment and results

Dataset

We utilise two benchmark robotic instrument segmentation datasets of EndoVis17¹ and EndoVis18.² The dataset consists of instrument segmentation for different video sequences. We split the EndoVis17 first video sequences of 1 to 8 for the training and the remaining sequences of 8 and 9 for the testing. For EndoVis18, we split the sequences of 2, 5, 9, and 15 for testing and the remaining sequences for training followed by ISINet [7].

Table 2

Comparison of DETR and our model with different backbone networks for the detection and segmentation tasks

Method	Detection			Segmentation
Method	mAP@0.50:0.95	mAP@0.50	mAP@0.75	mIoU	DICE
DETR-R50	61.4	82.6	71.3	–	–
DETR-SwinB	64.6	83.4	73.1	–	–
Surgical-DeSAM (ResNet50)	58.9	80.6	66.9	75.2	82.5
Surgical-DeSAM (Swin)	61.6	83.2	71.2	82.4	89.6

Implementation details

We choose AdamW optimiser with the learning rate of $10^{-4}$ and weight decay of 0.1 to update the model parameters. The baseline DETR and SAM codes are adopted from the official repositories which utilise Pytorch framework for deep learning network.

Results

We conduct experiments on both object detection and semantic segmentation tasks on the robotic instrument dataset and obtain the instance segmentation performance of our model. Table 1 shows the comparison of performances of our model and other SOTA models for robotic instrument instance segmentation on Endovis 17 and Endovis 18 datasets. It is obvious that our Surgical-DeSAM outperforms the other SOTA segmentation models on both mIoU and DICE scores. The qualitative visualisation of the predictions is presented in Fig. 2. There are almost no false positives with our model as it segments the whole instrument based on the bounding box class predicted by the Swin-DETR. We observed the high detection performance with Swin-DETR at Table 2 where predicted bounding boxes are mostly accurate with slight deviation of the box regions.

Ablation study

To investigate the superiority of the Swin-transformer [5] backbone over ResNet50 [4], we conducted an ablation study focusing on detection tasks alone and on both detection prompt and segmentation tasks. In Table 2, the first two rows demonstrate superior detection performance of DETR-SwinB (DETR with Swin-transformer) compared to DETR-R50 (DETR with ResNet50). Conversely, the subsequent rows compare the results of Surgical-DeSAM with ResNet50 and Swin-transformer backbones. It is evident that Surgical-DeSAM with a Swin-transformer backbone significantly outperforms Surgical-DeSAM with a ResNet50 backbone, achieving a 2.7% higher mAP in the detection task and a 7.1% higher DICE score in the segmentation task.

Discussion and conclusion

In this paper, we have presented a novel model architecture, Surgical-DeSAM, by decoupling SAM to automate the bounding box prompting for surgical instrument segmentation. To get better feature extraction, we replaced ResNet50 with the Swin-transformer for instrument detection. To automate the bounding box prompting, we decouple the SAM by removing the image encoder and feeding the DETR encoder features and predicted bounding boxes to the SAM mask decoder and prompt encoder to obtain the final segmentation. The experimental results demonstrate the efficiency of our model by comparing it with other state-of-the-art segmentation techniques for surgical instrument segmentation. Future work could focus on the robustness and reliability of the Surgical-DeSAM-based detection and segmentation tasks.

Acknowledgements

This work was carried during the dissertation project of Yuyang Sheng MSc in Robotics and Computation, Department of Computer Science, University College London. This work was supported in whole, or in part, by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) [203145/Z/16/Z] and the Engineering and Physical Sciences Research Council (EPSRC) [EP/W00805X/1, EP/Y01958X/1].

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

This articles does not contain patient data.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Unsere Produktempfehlungen

Die Chirurgie

Print-Titel

Das Abo mit mehr Tiefe

Mit der Zeitschrift Die Chirurgie erhalten Sie zusätzlich Online-Zugriff auf weitere 43 chirurgische Fachzeitschriften, CME-Fortbildungen, Webinare, Vorbereitungskursen zur Facharztprüfung und die digitale Enzyklopädie e.Medpedia.

Jetzt informieren

e.Med Interdisziplinär

Kombi-Abonnement

Jetzt e.Med zum Sonderpreis bestellen!

Für Ihren Erfolg in Klinik und Praxis - Die beste Hilfe in Ihrem Arbeitsalltag

Mit e.Med Interdisziplinär erhalten Sie Zugang zu allen CME-Fortbildungen und Fachzeitschriften auf SpringerMedizin.de.

Jetzt bestellen und 100 € sparen!

Jetzt testen ¹

e.Med Radiologie

Kombi-Abonnement

Mit e.Med Radiologie erhalten Sie Zugang zu CME-Fortbildungen des Fachgebietes Radiologie, den Premium-Inhalten der radiologischen Fachzeitschriften, inklusive einer gedruckten Radiologie-Zeitschrift Ihrer Wahl.

Jetzt testen ²

https://endovissub2017-roboticinstrumentsegmentation.grand-challenge.org/.

https://endovissub2018-roboticscenesegmentation.grand-challenge.org/.

Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, et al (2023) Segment anything. arXiv preprint arXiv:2304.02643

Ma J, Wang B (2023) Segment anything in medical images. arXiv preprint arXiv:2304.12306

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229. Springer

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666

González C, Bravo-Sánchez L, Arbelaez P (2020) Isinet: an instance-based approach for surgical instrument segmentation. In: Conference on medical image computing and computer-assisted intervention, pp 595–605. Springer

Iglovikov V, Shvets A (2018) Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746

Jin Y, Cheng K, Dou Q, Heng P-A (2019) Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In: Medical image computing and computer assisted intervention–MICCAI 2019: 22nd international conference, Shenzhen, China, Proceedings, Part V 22, pp 440–448. Springer

10.

Zhao Z, Jin Y, Gao X, Dou Q, Heng P-A (2020) Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In: Medical image computing and computer assisted intervention–MICCAI 2020: 23rd International conference, Lima, Peru, Proceedings, Part III 23, pp 679–689. Springer

11.

Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2022) Trackformer: multi-object tracking with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8844–8854

12.

Zhao Z, Jin Y, Heng P-A (2022) Trasetr: track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery. In: 2022 International conference on robotics and automation (ICRA), pp 11186–11193. IEEE

13.

Baby B, et al (2023) From forks to forceps: a new framework for instance segmentation of surgical instruments. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 6191–6201

14.

Yue W, Zhang J, Hu K, Xia Y, Luo J, Wang Z (2023) Surgicalsam: efficient class promptable surgical instrument segmentation. arXiv preprint arXiv:2308.08746

15.

Wang A, Islam M, Xu M, Zhang Y, Ren H (2023) Sam meets robotic surgery: an empirical study on generalization, robustness and adaptation. Medical image computing and computer assisted intervention— MICCAI 2023 workshops: ISIC 2023. Care-AI 2023, MedAGI 2023, DeCaF 2023, held in conjunction with MICCAI 2023, Vancouver, BC, Canada, Proceedings. Springer, Berlin, Heidelberg, pp 234–244

Titel: Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery
verfasst von: Yuyang Sheng
Sophia Bano
Matthew J. Clarkson
Mobarakol Islam
Publikationsdatum: 17.05.2024
Verlag: Springer International Publishing
Erschienen in: International Journal of Computer Assisted Radiology and Surgery
Print ISSN: 1861-6410
Elektronische ISSN: 1861-6429
DOI: https://doi.org/10.1007/s11548-024-03163-6

Update Radiologie

Bestellen Sie unseren Fach-Newsletter und bleiben Sie gut informiert.

Newsletter bestellen

Live-Webinar "Urologie und Sexualmedizin in der Praxis"

Springer Medizin

Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery

Abstract

Purpose

Methods

Results

Conclusion

Publisher's Note

Introduction

Methodology

Preliminaries

SAM

DETR

Surgical-DeSAM

SWIN-DETR

Decoupling SAM

Experiment and results

Dataset

Implementation details

Results

Ablation study

Discussion and conclusion

Acknowledgements

Declarations

Conflict of interest

Ethical approval

Publisher's Note

Unsere Produktempfehlungen

Die Chirurgie

e.Med Interdisziplinär

e.Med Radiologie

Neu im Fachgebiet Radiologie

Mammakarzinom: Brustdichte beeinflusst rezidivfreies Überleben

„Übersichtlicher Wegweiser“: Lauterbachs umstrittener Klinik-Atlas ist online

Klinikreform soll zehntausende Menschenleben retten

Darf man die Behandlung eines Neonazis ablehnen?

Update Radiologie

Live-Webinar "Urologie und Sexualmedizin in der Praxis"

Springer Medizin

Abstract

Purpose

Methods

Results

Conclusion

Publisher's Note

Introduction

Methodology

Preliminaries

SAM

DETR

Surgical-DeSAM

SWIN-DETR

Decoupling SAM

Experiment and results

Dataset

Implementation details

Results

Ablation study

Discussion and conclusion

Acknowledgements

Declarations

Conflict of interest

Ethical approval

Informed consent

Publisher's Note

Unsere Produktempfehlungen

Die Chirurgie

e.Med Interdisziplinär

e.Med Radiologie

Neu im Fachgebiet Radiologie

Mammakarzinom: Brustdichte beeinflusst rezidivfreies Überleben

„Übersichtlicher Wegweiser“: Lauterbachs umstrittener Klinik-Atlas ist online

Klinikreform soll zehntausende Menschenleben retten

Darf man die Behandlung eines Neonazis ablehnen?

Update Radiologie