An End-to-End Spatial-Temporal Transformer Model for Surgical Action Triplet Recognition

被引:0
作者
Zou, Xiaoyang [1 ]
Yu, Derong [1 ]
Tao, Rong [1 ]
Zheng, Guoyan [1 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Med Robot, Sch Biomed Engn, Dongchuan Rd, Shanghai, Peoples R China
来源
12TH ASIAN-PACIFIC CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING, VOL 2, APCMBE 2023 | 2024年 / 104卷
基金
中国国家自然科学基金;
关键词
Action recognition; Surgical action triplet; Transformer; Self-attention; Auxiliary supervision;
D O I
10.1007/978-3-031-51485-2_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Surgical activity recognition plays an important role in computer assisted surgery. Recently, surgical action triplet has become the representative definition of fine-grained surgical activity, which is a combination of three components in the form of . In this work, we propose an end-to-end spatial-temporal transformer model trained with multi-task auxiliary supervisions, establishing a powerful baseline for surgical action triplet recognition. Rigorous experiments are conducted on a publicly available dataset CholecT45 for ablation studies and comparisons with state-of-the-arts. Experimental results show that our method outperforms state-of-the-arts by 6.8%, achieving 36.5% mAP for triplet recognition. Our method won the 2nd place in action triplet recognition racing track of CholecTriplet 2022 Challenge, which also demonstrates the superior capability of our method.
引用
收藏
页码:114 / 120
页数:7
相关论文
共 50 条
[21]   End-to-End Video Scene Graph Generation With Temporal Propagation Transformer [J].
Zhang, Yong ;
Pan, Yingwei ;
Yao, Ting ;
Huang, Rui ;
Mei, Tao ;
Chen, Chang-Wen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :1613-1625
[22]   ST-HViT: spatial-temporal hierarchical vision transformer for action recognition [J].
Xia, Limin ;
Fu, Weiye .
PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (01)
[23]   Spatial-Temporal Attention for Action Recognition [J].
Sun, Dengdi ;
Wu, Hanqing ;
Ding, Zhuanlian ;
Luo, Bin ;
Tang, Jin .
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 :854-864
[24]   Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition [J].
Gao, Zhimin ;
Wang, Peitao ;
Lv, Pei ;
Jiang, Xiaoheng ;
Liu, Qidong ;
Wang, Pichao ;
Xu, Mingliang ;
Li, Wanqing .
COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 :155-171
[25]   Pyramid Spatial-Temporal Graph Transformer for Skeleton-Based Action Recognition [J].
Chen, Shuo ;
Xu, Ke ;
Jiang, Xinghao ;
Sun, Tanfeng .
APPLIED SCIENCES-BASEL, 2022, 12 (18)
[26]   Simple Data Augmented Transformer End-To-End Tibetan Speech Recognition [J].
Yang, Xiaodong ;
Wang, Weizhe ;
Yang, Hongwu ;
Jiang, Jiaolong .
2020 IEEE 3RD INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP 2020), 2020, :148-152
[27]   A Transformer-Based End-to-End Automatic Speech Recognition Algorithm [J].
Dong, Fang ;
Qian, Yiyang ;
Wang, Tianlei ;
Liu, Peng ;
Cao, Jiuwen .
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 :1592-1596
[28]   Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution [J].
Sarraf, Saman ;
Kabia, Milton .
MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2023, 5 (04) :1320-1339
[29]   CTsynther: Contrastive Transformer Model for End-to-End Retrosynthesis Prediction [J].
Lu, Hao ;
Wei, Zhiqiang ;
Zhang, Kun ;
Wang, Xuze ;
Ali, Liaqat ;
Liu, Hao .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2024, 21 (06) :2235-2245
[30]   An end-to-end wafer map defect recognition model [J].
Xia, Min ;
Mu, Xiaobao ;
Wu, Zhonghai .
2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, :1205-1210