End-to-End Temporal Action Detection With Transformer

被引:107
|
作者
Liu, Xiaolong [1 ]
Wang, Qimeng [1 ]
Hu, Yao [2 ]
Tang, Xu [2 ]
Zhang, Shiwei [3 ]
Bai, Song [4 ]
Bai, Xiang [5 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China
[2] Alibaba Grp, Beijing 100102, Peoples R China
[3] Alibaba Grp, Hangzhou 311121, Peoples R China
[4] ByteDance Inc, Singapore 048583, Singapore
[5] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China
关键词
Pipelines; Transformers; Proposals; Training; Feature extraction; Task analysis; Detectors; Transformer; temporal action detection; temporal action localization; action recognition;
D O I
10.1109/TIP.2022.3195321
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.
引用
收藏
页码:5427 / 5441
页数:15
相关论文
共 50 条
  • [41] End-to-end point cloud registration with transformer
    Wang, Yong
    Zhou, Pengbo
    Geng, Guohua
    An, Li
    Zhang, Qi
    ARTIFICIAL INTELLIGENCE REVIEW, 2024, 58 (01)
  • [42] Sequential Transformer for End-to-End Person Search
    Chen, Long
    Xu, Jinhua
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 226 - 238
  • [43] MulT: An End-to-End Multitask Learning Transformer
    Bhattacharjee, Deblina
    Zhang, Tong
    Suesstrunk, Sabine
    Salzmann, Mathieu
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12021 - 12031
  • [44] End-to-End Video Text Spotting with Transformer
    Wu, Weijia
    Cai, Yuanqiang
    Shen, Chunhua
    Zhang, Debing
    Fu, Ying
    Zhou, Hong
    Luo, Ping
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 4019 - 4035
  • [45] RESC: REfine the SCore with adaptive transformer head for end-to-end object detection
    Wang, Honglie
    Jiang, Rong
    Xu, Jian
    Sun, Shouqian
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (14): : 12017 - 12028
  • [46] END-TO-END NETWORK BASED ON TRANSFORMER FOR AUTOMATIC DETECTION OF COVID-19
    Cai, Cong
    Liu, Bin
    Tao, Jianhua
    Tian, Zhengkun
    Lu, Jiahao
    Wang, Kexin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 9082 - 9086
  • [47] RESC: REfine the SCore with adaptive transformer head for end-to-end object detection
    Honglie Wang
    Rong Jiang
    Jian Xu
    Shouqian Sun
    Neural Computing and Applications, 2022, 34 : 12017 - 12028
  • [48] GeometryMotion-Transformer: An End-to-End Framework for 3D Action Recognition
    Liu, Jiaheng
    Guo, Jinyang
    Xu, Dong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5649 - 5661
  • [49] End-to-End Video Object Detection with Spatial-Temporal Transformers
    He, Lu
    Zhou, Qianyu
    Li, Xiangtai
    Niu, Li
    Cheng, Guangliang
    Li, Xiao
    Liu, Wenxuan
    Tong, Yunhai
    Ma, Lizhuang
    Zhang, Liqing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1507 - 1516
  • [50] Synthetic Temporal Anomaly Guided End-to-End Video Anomaly Detection
    Astrid, Marcella
    Zaheer, Muhammad Zaigham
    Lee, Seung-Ik
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 207 - 214