Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery

被引:12
作者
Wu, Jie Ying [1 ]
Tamhane, Aniruddha [1 ]
Kazanzides, Peter [1 ]
Unberath, Mathias [1 ]
机构
[1] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
关键词
Machine learning; Surgical robotics; Surgical action recognition; Surgical skill recognition;
D O I
10.1007/s11548-021-02343-y
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Purpose Multi- and cross-modal learning consolidates information from multiple data sources which may offer a holistic representation of complex scenarios. Cross-modal learning is particularly interesting, because synchronized data streams are immediately useful as self-supervisory signals. The prospect of achieving self-supervised continual learning in surgical robotics is exciting as it may enable lifelong learning that adapts to different surgeons and cases, ultimately leading to a more general machine understanding of surgical processes. Methods We present a learning paradigm using synchronous video and kinematics from robot-mediated surgery. Our approach relies on an encoder-decoder network that maps optical flow to the corresponding kinematics sequence. Clustering on the latent representations reveals meaningful groupings for surgeon gesture and skill level. We demonstrate the generalizability of the representations on the JIGSAWS dataset by classifying skill and gestures on tasks not used for training. Results For tasks seen in training, we report a 59 to 70% accuracy in surgical gestures classification. On tasks beyond the training setup, we note a 45 to 65% accuracy. Qualitatively, we find that unseen gestures form clusters in the latent space of novice actions, which may enable the automatic identification of novel interactions in a lifelong learning scenario. Conclusion From predicting the synchronous kinematics sequence, optical flow representations of surgical scenes emerge that separate well even for new tasks that the model had not seen before. While the representations are useful immediately for a variety of tasks, the self-supervised learning paradigm may enable research in lifelong and user-specific learning.
引用
收藏
页码:779 / 787
页数:9
相关论文
共 25 条
[1]   A Dataset and Benchmarks for Segmentation and Recognition of Gestures in Robotic Surgery [J].
Ahmidi, Narges ;
Tao, Lingling ;
Sefati, Shahin ;
Gao, Yixin ;
Lea, Colin ;
Haro, Benjamin Bejar ;
Zappella, Luca ;
Khudanpur, Sanjeev ;
Vidal, Rene ;
Hager, Gregory D. .
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2017, 64 (09) :2025-2041
[2]  
Arandjelovic Relja, 2018, P EUROPEAN C COMPUTE, P435
[3]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[4]   Automated Surgical Activity Recognition with One Labeled Sequence [J].
DiPietro, Robert ;
Hager, Gregory D. .
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT V, 2019, 11768 :458-466
[5]   Unsupervised Learning for Surgical Motion by Learning to Predict the Future [J].
DiPietro, Robert ;
Hager, Gregory D. .
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2018, PT IV, 2018, 11073 :281-288
[6]   Two-frame motion estimation based on polynomial expansion [J].
Farnebäck, G .
IMAGE ANALYSIS, PROCEEDINGS, 2003, 2749 :363-370
[7]   Video-based surgical skill assessment using 3Dconvolutional neural networks [J].
Funke, Isabel ;
Mees, Soeren Torge ;
Weitz, Juergen ;
Speidel, Stefanie .
INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2019, 14 (07) :1217-1225
[8]  
Gao Y., 2014, P MOD MON COMP ASS I, V1, P1
[9]  
Guthart G. S., 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), P618, DOI 10.1109/ROBOT.2000.844121
[10]   Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey [J].
Jing, Longlong ;
Tian, Yingli .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) :4037-4058