Multipath Attention and Adaptive Gating Network for Video Action Recognition

被引：1

作者：

Zhang, Haiping ^{[1
]}

Hu, Zepeng ^{[1
]}

Yu, Dongjin ^{[1
]}

Guan, Liming ^{[1
]}

Liu, Xu ^{[2
]}

Ma, Conghao ^{[2
]}

机构：

[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou 310018, Zhejiang, Peoples R China

[2] Hangzhou Dianzi Univ, Sch Elect & Informat, Hangzhou 310018, Zhejiang, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2024年 / 56卷 / 02期

关键词：

Action recognition; Attention mechanism; 3D convolution; Temporal modeling;

D O I：

10.1007/s11063-024-11591-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

3D CNN networks can model existing large action recognition datasets well in temporal modeling and have made extremely great progress in the field of RGB-based video action recognition. However, the previous 3D CNN models also face many troubles. For video feature extraction convolutional kernels are often designed and fixed in each layer of the network, which may not be suitable for the diversity of data in action recognition tasks. In this paper, a new model called Multipath Attention and Adaptive Gating Network (MAAGN) is proposed. The core idea of MAAGN is to use the spatial difference module (SDM) and the multi-angle temporal attention module (MTAM) in parallel at each layer of the multipath network to obtain spatial and temporal features, respectively, and then dynamically fuses the spatial-temporal features by the adaptive gating module (AGM). SDM explores the action video spatial domain using difference operators based on the attention mechanism, while MTAM tends to explore the action video temporal domain in terms of both global timing and local timing. AGM is built on an adaptive gate unit, the value of which is determined by the input of each layer, and it is unique in each layer, dynamically fusing the spatial and temporal features in the paths of each layer in the multipath network. We construct the temporal network MAAGN, which has a competitive or better performance than state-of-the-art methods in video action recognition, and we provide exhaustive experiments on several large datasets to demonstrate the effectiveness of our approach.

引用

页数：20

共 48 条

[1] Bertasius G, 2021, PR MACH LEARN RES, V139
[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[3] Dynamic Convolution: Attention over Convolution Kernels
Chen, Yinpeng
Dai, Xiyang
Liu, Mengchen
Chen, Dongdong
Yuan, Lu
Liu, Zicheng
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 11027 - 11036
[4] Control of goal-directed and stimulus-driven attention in the brain
Corbetta, M
Shulman, GL
[J]. NATURE REVIEWS NEUROSCIENCE, 2002, 3 (03) : 201 - 215
[5] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[6] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[7] SlowFast Networks for Video Recognition
Feichtenhofer, Christoph
Fan, Haoqi
Malik, Jitendra
He, Kaiming
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
[8] Spatiotemporal Multiplier Networks for Video Action Recognition
Feichtenhofer, Christoph
Pinz, Axel
Wildes, Richard P.
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7445 - 7454
[9] The "something something" video database for learning and evaluating visual common sense
Goyal, Raghav
Kahou, Samira Ebrahimi
Michalski, Vincent
Materzynska, Joanna
Westphal, Susanne
Kim, Heuna
Haenel, Valentin
Fruend, Ingo
Yianilos, Peter
Mueller-Freitag, Moritz
Hoppe, Florian
Thurau, Christian
Bax, Ingo
Memisevic, Roland
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5843 - 5851
[10] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778

← 1 2 3 4 5 →