Metric-Based Attention Feature Learning for Video Action Recognition

被引：10

作者：

Kim, Dae Ha ^{[1
]}

Anvarov, Fazliddin ^{[1
]}

Lee, Jun Min ^{[1
]}

Song, Byung Cheol ^{[1
]}

机构：

[1] Inha Univ, Dept Elect & Comp Engn, Incheon 22212, South Korea

来源：

IEEE ACCESS | 2021年 / 9卷

关键词：

Feature extraction; Measurement; Three-dimensional displays; Task analysis; Two dimensional displays; Licenses; Kernel; Body action recognition; 3D CNN; attention map learning; distance metric learning;

D O I：

10.1109/ACCESS.2021.3064934

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Conventional approaches for video action recognition were designed to learn feature maps using 3D convolutional neural networks (CNNs). For better action recognition, they trained the large-scale video datasets with the representation power of 3D CNN. However, action recognition is still a challenging task. Since the previous methods rarely distinguish human body from environment, they often overfit background scenes. Note that separating human body from background allows to learn distinct representations of human action. This paper proposes a novel attention module aiming at only action part(s), while neglecting non-action part(s) such as background. First, the attention module employs triplet loss to differentiate active features from non-active or less active features. Second, two attention modules based on spatial and channel domains are proposed to enhance the feature representation ability for action recognition. The spatial attention module is to learn spatial correlation of features, and the channel attention module is to learn channel correlation. Experimental results show that the proposed method achieves state-of-the-art performance of 41.41% and 55.21% on Diving48 and Something-V1 datasets, respectively. In addition, the proposed method provides competitive performance even on UCF101 and HMDB-51 datasets, i.e., 95.83% on UCF-101 and 74.33% on HMDB-51.

引用

页码：39218 / 39228

页数：11

共 46 条

[1]

[Anonymous], 2018, P ECCV

[2]

[Anonymous], 2018, P EUR C COMP VIS ECC

[3]

Bertasius Gedas, 2018, ARXIV181204172

[4] Action Recognition with Dynamic Image Networks [J].

Bilen, Hakan ;

Fernando, Basura ;

Gavves, Efstratios ;

Vedaldi, Andrea .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (12) :2799-2813

[5] OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].

Cao, Zhe ;

Hidalgo, Gines ;

Simon, Tomas ;

Wei, Shih-En ;

Sheikh, Yaser .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186

[6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[7]

Chen YP, 2018, ADV NEUR IN, V31

[8] Person Re-Identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function [J].

Cheng, De ;

Gong, Yihong ;

Zhou, Sanping ;

Wang, Jinjun ;

Zheng, Nanning .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1335-1344

[9] MARS: Motion-Augmented RGB Stream for Action Recognition [J].

Crasto, Nieves ;

Weinzaepfel, Philippe ;

Alahari, Karteek ;

Schmid, Cordelia .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7874-7883

[10] Fine-grained Categorization and Dataset Bootstrapping using Deep Metric Learning with Humans in the Loop [J].

Cui, Yin ;

Zhou, Feng ;

Lin, Yuanqing ;

Belongie, Serge .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1153-1162

← 1 2 3 4 5 →