End-to-End Temporal Action Detection With Transformer

被引:107
|
作者
Liu, Xiaolong [1 ]
Wang, Qimeng [1 ]
Hu, Yao [2 ]
Tang, Xu [2 ]
Zhang, Shiwei [3 ]
Bai, Song [4 ]
Bai, Xiang [5 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China
[2] Alibaba Grp, Beijing 100102, Peoples R China
[3] Alibaba Grp, Hangzhou 311121, Peoples R China
[4] ByteDance Inc, Singapore 048583, Singapore
[5] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China
关键词
Pipelines; Transformers; Proposals; Training; Feature extraction; Task analysis; Detectors; Transformer; temporal action detection; temporal action localization; action recognition;
D O I
10.1109/TIP.2022.3195321
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.
引用
收藏
页码:5427 / 5441
页数:15
相关论文
共 50 条
  • [21] End-to-end temporal attention extraction and human action recognition
    Hong Zhang
    Miao Xin
    Shuhang Wang
    Yifan Yang
    Lei Zhang
    Helong Wang
    Machine Vision and Applications, 2018, 29 : 1127 - 1142
  • [22] End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
    Liu, Shuming
    Zhang, Chen-Lin
    Zhao, Chen
    Ghanem, Bernard
    arXiv, 2023,
  • [23] EENED: End-to-End Neural Epilepsy Detection based on Convolutional Transformer
    Liu, Chenyu
    Zhou, Xinliang
    Liu, Yang
    2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 368 - 371
  • [24] RQFormer: Rotated Query Transformer for end-to-end oriented object detection
    Zhao, Jiaqi
    Ding, Zeyu
    Zhou, Yong
    Zhu, Hancheng
    Du, Wen-Liang
    Yao, Rui
    El Saddik, Abdulmotaleb
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 266
  • [25] DefectTR: End-to-end defect detection for sewage networks using a transformer
    Dang, L. Minh
    Wang, Hanxiang
    Li, Yanfen
    Nguyen, Tan N.
    Moon, Hyeonjoon
    CONSTRUCTION AND BUILDING MATERIALS, 2022, 325
  • [26] An End-to-End Transformer Model for 3D Object Detection
    Misra, Ishan
    Girdhar, Rohit
    Joulin, Armand
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2886 - 2897
  • [27] Transformer Generates Conditional Convolution Kernels for End-to-End Lane Detection
    Zhuang, Long
    Jiang, Tiezhen
    Qiu, Meng
    Wang, Anqi
    Huang, Zhixiang
    IEEE SENSORS JOURNAL, 2024, 24 (17) : 28383 - 28396
  • [28] Transformer-based End-to-End Object Detection in Aerial Images
    Vo, Nguyen D.
    Le, Nguyen
    Ngo, Giang
    Doan, Du
    Le, Do
    Nguyen, Khang
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (10) : 1072 - 1079
  • [29] V-DETR: Pure Transformer for End-to-End Object Detection
    Dung Nguyen
    Van-Dung Hoang
    Van-Tuong-Lan Le
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 14796 : 120 - 131
  • [30] End-to-End Real-Time Vanishing Point Detection with Transformer
    Tong, Xin
    Peng, Shi
    Guo, Yufei
    Huang, Xuhui
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5243 - 5251