Multi-Dimensional Attention With Similarity Constraint for Weakly-Supervised Temporal Action Localization

被引:9
作者
Chen, Zhengyan [1 ]
Liu, Hong [1 ]
Zhang, Linlin [1 ]
Liao, Xin [2 ]
机构
[1] Peking Univ, Shenzhen Grad Sch, Key Lab Machine Percept, Beijing 100871, Peoples R China
[2] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
基金
中国国家自然科学基金;
关键词
Videos; Location awareness; Feature extraction; Proposals; Optical flow; Task analysis; Annotations; Multi-dimensional attention; temporal action localization; video analysis; weakly supervised learning;
D O I
10.1109/TMM.2022.3174344
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Weakly-supervised temporal action localization (WTAL) is a challenging task in understanding untrimmed videos, in which no frame-wise annotation is provided during training, only the video-level category label is available. Current methods mainly adopt temporal attention branches to conduct foreground-background separation with RGB and optical flow features simply concatenated, regardless of the discriminative spacial features and the complementarity between different modalities. In this work, we propose a Multi-Dimensional Attention (MDA) method to explore attention mechanism across three dimensions in weakly supervised action localization, i.e., 1) temporal attention that focuses on segments containing action instances, 2) channel attention that discovers the most relevant cues for action description, and 3) modal attention that fuses RGB and flow information adaptively based on feature magnitudes during background modeling. In addition, we introduce a similarity constraint loss to refine the action segment representation in feature space, which helps the network to detect less discriminative frames of an action to capture the full action boundaries. The proposed MDA with similarity constraints can be easily applied to existing action detection frameworks with few parameters. Extensive experiments on THUMOS'14 and ActivityNet v1.2 datasets show that the proposed method outperforms the current state-of-the-art WTAL approaches, and achieves comparable results with some advanced fully-supervised methods.
引用
收藏
页码:4349 / 4360
页数:12
相关论文
共 69 条
[1]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[4]   Relation Attention for Temporal Action Localization [J].
Chen, Peihao ;
Gan, Chuang ;
Shen, Guangyao ;
Huang, Wenbing ;
Zeng, Runhao ;
Tan, Mingkui .
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (10) :2723-2733
[5]  
Chen Yuying, 2020, ARXIV200500754
[6]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[9]   CTAP: Complementary Temporal Action Proposal Generation [J].
Gao, Jiyang ;
Chen, Kan ;
Nevatia, Ram .
COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 :70-85
[10]   TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals [J].
Gao, Jiyang ;
Yang, Zhenheng ;
Sun, Chen ;
Chen, Kan ;
Nevatia, Ram .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3648-3656