An End-to-End Spatial-Temporal Transformer Model for Surgical Action Triplet Recognition

被引:0
作者
Zou, Xiaoyang [1 ]
Yu, Derong [1 ]
Tao, Rong [1 ]
Zheng, Guoyan [1 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Med Robot, Sch Biomed Engn, Dongchuan Rd, Shanghai, Peoples R China
来源
12TH ASIAN-PACIFIC CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING, VOL 2, APCMBE 2023 | 2024年 / 104卷
基金
中国国家自然科学基金;
关键词
Action recognition; Surgical action triplet; Transformer; Self-attention; Auxiliary supervision;
D O I
10.1007/978-3-031-51485-2_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Surgical activity recognition plays an important role in computer assisted surgery. Recently, surgical action triplet has become the representative definition of fine-grained surgical activity, which is a combination of three components in the form of . In this work, we propose an end-to-end spatial-temporal transformer model trained with multi-task auxiliary supervisions, establishing a powerful baseline for surgical action triplet recognition. Rigorous experiments are conducted on a publicly available dataset CholecT45 for ablation studies and comparisons with state-of-the-arts. Experimental results show that our method outperforms state-of-the-arts by 6.8%, achieving 36.5% mAP for triplet recognition. Our method won the 2nd place in action triplet recognition racing track of CholecTriplet 2022 Challenge, which also demonstrates the superior capability of our method.
引用
收藏
页码:114 / 120
页数:7
相关论文
共 50 条
[11]   An End-to-End Transformer Model for Crowd Localization [J].
Liang, Dingkang ;
Xu, Wei ;
Bai, Xiang .
COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 :38-54
[12]   SFTT: A Spatial-Frequency-Temporal-Based End-to-End Transformer for Heart Rate Estimation [J].
Dey, Rakesh ;
Palaiahnakote, Shivakumar ;
Bhattacharya, Saumik ;
Pal, Umapada ;
Chanda, Sukalpa .
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2025,
[13]   Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly [J].
Huang, Jianfeng ;
Liu, Xiang ;
Hu, Huan ;
Tang, Shanghua ;
Li, Chenyang ;
Zhao, Shaoan ;
Lin, Yimin ;
Wang, Kai ;
Liu, Zhaoxiang ;
Lian, Shiguo .
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 :114-130
[14]   An Investigation of Positional Encoding in Transformer-based End-to-end Speech Recognition [J].
Yue, Fengpeng ;
Ko, Tom .
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[15]   STSD: spatial-temporal semantic decomposition transformer for skeleton-based action recognition [J].
Cui, Hu ;
Hayama, Tessai .
MULTIMEDIA SYSTEMS, 2024, 30 (01)
[16]   Semantic Mask for Transformer based End-to-End Speech Recognition [J].
Wang, Chengyi ;
Wu, Yu ;
Du, Yujiao ;
Li, Jinyu ;
Liu, Shujie ;
Lu, Liang ;
Ren, Shuo ;
Ye, Guoli ;
Zhao, Sheng ;
Zhou, Ming .
INTERSPEECH 2020, 2020, :971-975
[17]   Transformer-based end-to-end scene text recognition [J].
Zhu, Xinghao ;
Zhang, Zhi .
PROCEEDINGS OF THE 2021 IEEE 16TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA 2021), 2021, :1691-1695
[18]   END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER [J].
Chang, Xuankai ;
Zhang, Wangyou ;
Qian, Yanmin ;
Le Roux, Jonathan ;
Watanabe, Shinji .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :6134-6138
[19]   END-TO-END PART-LEVEL ACTION PARSING WITH TRANSFORMER [J].
Chen, Xiaojia ;
Wang, Xuanhan ;
Chen, Beitao ;
Gao, Lianli .
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, :756-761
[20]   ResneSt-Transformer: Joint attention segmentation-free for end-to-end handwriting paragraph recognition model [J].
Hamdan, Mohammed ;
Cheriet, Mohamed .
ARRAY, 2023, 19