End-to-End Temporal Action Detection With Transformer

被引：107

作者：

Liu, Xiaolong ^{[1
]}

Wang, Qimeng ^{[1
]}

Hu, Yao ^{[2
]}

Tang, Xu ^{[2
]}

Zhang, Shiwei ^{[3
]}

Bai, Song ^{[4
]}

Bai, Xiang ^{[5
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China

[2] Alibaba Grp, Beijing 100102, Peoples R China

[3] Alibaba Grp, Hangzhou 311121, Peoples R China

[4] ByteDance Inc, Singapore 048583, Singapore

[5] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

关键词：

Pipelines; Transformers; Proposals; Training; Feature extraction; Task analysis; Detectors; Transformer; temporal action detection; temporal action localization; action recognition;

D O I：

10.1109/TIP.2022.3195321

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.

引用

页码：5427 / 5441

页数：15

共 50 条

[21] End-to-end temporal attention extraction and human action recognition
Hong Zhang
Miao Xin
Shuhang Wang
Yifan Yang
Lei Zhang
Helong Wang
Machine Vision and Applications, 2018, 29 : 1127 - 1142
[22] End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Liu, Shuming
Zhang, Chen-Lin
Zhao, Chen
Ghanem, Bernard
arXiv, 2023,
[23] EENED: End-to-End Neural Epilepsy Detection based on Convolutional Transformer
Liu, Chenyu
Zhou, Xinliang
Liu, Yang
2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 368 - 371
[24] RQFormer: Rotated Query Transformer for end-to-end oriented object detection
Zhao, Jiaqi
Ding, Zeyu
Zhou, Yong
Zhu, Hancheng
Du, Wen-Liang
Yao, Rui
El Saddik, Abdulmotaleb
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 266
[25] DefectTR: End-to-end defect detection for sewage networks using a transformer
Dang, L. Minh
Wang, Hanxiang
Li, Yanfen
Nguyen, Tan N.
Moon, Hyeonjoon
CONSTRUCTION AND BUILDING MATERIALS, 2022, 325
[26] An End-to-End Transformer Model for 3D Object Detection
Misra, Ishan
Girdhar, Rohit
Joulin, Armand
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2886 - 2897
[27] Transformer Generates Conditional Convolution Kernels for End-to-End Lane Detection
Zhuang, Long
Jiang, Tiezhen
Qiu, Meng
Wang, Anqi
Huang, Zhixiang
IEEE SENSORS JOURNAL, 2024, 24 (17) : 28383 - 28396
[28] Transformer-based End-to-End Object Detection in Aerial Images
Vo, Nguyen D.
Le, Nguyen
Ngo, Giang
Doan, Du
Le, Do
Nguyen, Khang
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (10) : 1072 - 1079
[29] V-DETR: Pure Transformer for End-to-End Object Detection
Dung Nguyen
Van-Dung Hoang
Van-Tuong-Lan Le
INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 14796 : 120 - 131
[30] End-to-End Real-Time Vanishing Point Detection with Transformer
Tong, Xin
Peng, Shi
Guo, Yufei
Huang, Xuhui
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5243 - 5251

← 1 2 3 4 5 →