Recurrent Region Attention and Video Frame Attention Based Video Action Recognition Network Design

被引：0

作者：

Sang H.-F. ^{[1
]}

Zhao Z.-Y. ^{[1
]}

He D.-K. ^{[2
]}

机构：

[1] School of Information Science & Engineering, Shenyang University of Technology, Shenyang, 110870, Liaoning

[2] College of Information Science & Engineering, Northeastern University, Shenyang, 110819, Liaoning

来源：

Zhao, Zi-Yu (Maikuraky1022@outlook.com) | 1600年 / Chinese Institute of Electronics卷 / 48期

关键词：

Action recognition; Recurrent neural network; Recurrent region attention; Video frame attention;

D O I：

10.3969/j.issn.0372-2112.2020.06.002

中图分类号：

TN94 [电视];

学科分类号：

0810 ; 081001 ;

摘要：

In video frames, the complex environment background, lighting conditions and other visual information unrelated to action bring a lot of redundancy and noise to action spatial feature, which affects the accuracy of action recognition to some extent. In view of this, this paper proposes a recurrent region attention cell to capture the visual information of the region related to the action in spatial features. Based on the sequence nature of video, a recurrent region attention model (RRA) is proposed. Secondly, this paper proposes a video frame attention model (VFA) that can highlight the more important frames in the video sequence of the whole action, so as to reduce the interference brought by the similar before and after correlation between video sequences of different actions. Finally, this paper presents a network model which can perform end-to-end training: recurrent region attention and video frame attention based video action recognition network (RFANet). Experiments on two video action recognition benchmark UCF101 dataset and HMDB51 dataset show that the RFANet proposed in this paper can reliably identify the category of action in the video. Inspired by the two-stream structure, we construct a two-modalities RFANet network. In the same training conditions, the two-modalities RFANet network achieved optimal performance on both datasets. © 2020, Chinese Institute of Electronics. All right reserved.

引用

页码：1052 / 1061

页数：9

共 25 条

[11] Bahdanau D, Cho K, Bengio Y., Neural Machine Translation by Jointly Learning to Align and Translate
[12] Schuster M, Paliwal KK., Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, 45, 17, pp. 2673-2681, (1997)
[13] Soomro K, Zamir A R, Shah M., UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild
[14] Kuehne H, Jhuang H, Garrote E, Et al., HMDB: a large video database for human motion recognition, International Conference on Computer Vision, pp. 2556-2563, (2011)
[15] Deng J, Dong W, Socher R, Et al., ImageNet: A large-scale hierarchical image database, Computer Vision and Pattern Recognition, pp. 248-255, (2009)
[16] Xiang L, Chuang G, Et al., Multimodal keyless attention fusion for video classification, 32nd AAAI Conference on Artificial Intelligence, pp. 7202-7209, (2018)
[17] Yuan Y, Wang D, Wang Q., Memory-Augmented Temporal Dynamic Learning for Action Recognition
[18] Fan L, Huang W, Gan C, Et al., End-to-end learning of motion representation for video understanding, Computer Vision and Pattern Recognition, pp. 6016-6025, (2018)
[19] Sengupta B, Qian Y., Pillar Networks++: Distributed Non-parametric Deep and Wide Networks
[20] Peng X, Wang L, Wang X, Et al., Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice, Computer Vision and Image Understanding, 150, 2016, pp. 109-125, (2016)

← 1 2 3 →