Cascade multi-head attention networks for action recognition

被引：23

作者：

Wang, Jiaze ^{[1
,2
]}

Peng, Xiaojiang ^{[1
,2
]}

Qiao, Yu ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, SIAT SenseTime Joint Lab, ShenZhen Key Lab Comp Vis & Pattern Recognit, Shenzhen, Peoples R China

[2] Shenzhen Inst Artificial Intelligence & Robot Soc, SIAT Branch, Shenzhen, Peoples R China

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2020年 / 192卷

基金：

中国国家自然科学基金;

关键词：

Action recognition; Cascade multi-head attention network; Feature aggregation; Visual analysis;

D O I：

10.1016/j.cviu.2019.102898

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Long-term temporal information yields crucial cues for video action understanding. Previous researches always rely on sequential models such as recurrent networks, memory units, segmental models, self-attention mechanism to integrate the local temporal features for long-term temporal modeling. Recurrent or memory networks record temporal patterns (or relations) by memory units, which are proved to be difficult to capture long-term information in machine translation. Self-attention mechanisms directly aggregate all local information with attention weights which is more straightforward and efficient than the former. However, the attention weights from self-attention ignore the relations between local information and global information which may lead to unreliable attention. To this end, we propose a new attention network architecture, termed as Cascade multi-head ATtention Network (CATNet), which constructs video representations with two-level attentions, namely multi-head local self-attentions and relation based global attentions. Starting from the segment features generated by backbone networks, CATNet first learns multiple attention weights for each segment to capture the importance of local features in a self-attention manner. With the local attention weights, CATNet integrates local features into several global representations, and then learns the second level attention for the global information by a relation manner. Extensive experiments on Kinetics, HMDB51, and UCF101 show that our CATNet boosts the baseline network with a large margin. With only RGB information, we respectively achieve 75.8%, 75.2%, and 96.0% on these three datasets, which are comparable or superior to the state of the arts.

引用

页数：7

共 48 条

[1] Bandanau D., 2014, arXiv preprint arXiv:1409. 0473
[2] Camgoz NC, 2016, INT C PATT RECOG, P49, DOI 10.1109/ICPR.2016.7899606
[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[4] Chen JZ, 2016, PROCEEDINGS OF 2016 12TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), P551, DOI [10.1109/CIS.2016.133, 10.1109/CIS.2016.0134]
[5] P-CNN: Pose-based CNN Features for Action Recognition
Cheron, Guilhem
Laptev, Ivan
Schmid, Cordelia
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 3218 - 3226
[6] PoTion: Pose MoTion Representation for Action Recognition
Choutas, Vasileios
Weinzaepfel, Philippe
Revaud, Jerome
Schmid, Cordelia
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7024 - 7033
[7] Diba, 2016, ARXIV160808851
[8] Deep Temporal Linear Encoding Networks
Diba, Ali
Sharma, Vivek
Van Gool, Luc
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1541 - 1550
[9] Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[10] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497

← 1 2 3 4 5 →