Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

被引：12

作者：

Shi, Zhensheng ^{[1
]}

Cao, Liangjie ^{[1
]}

Guan, Cheng ^{[1
]}

Zheng, Haiyong ^{[1
]}

Gu, Zhaorui ^{[1
]}

Yu, Zhibin ^{[1
]}

Zheng, Bing ^{[1
]}

机构：

[1] Ocean Univ China, Dept Elect Engn, Qingdao 266100, Peoples R China

来源：

IEEE ACCESS | 2020年 / 8卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Action recognition; video understanding; spatiotemporal representation; visual attention; 3D-CNN; residual learning;

D O I：

10.1109/ACCESS.2020.2968024

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Learning spatiotemporal features via 3D-CNN (3D Convolutional Neural Network) models has been regarded as an effective approach for action recognition. In this paper, we explore visual attention mechanism for video analysis and propose a novel 3D-CNN model, dubbed AE-I3D (Attention-Enhanced Inflated-3D Network), for learning attention-enhanced spatiotemporal representation. The contribution of our AE-I3D is threefold: First, we inflate soft attention in spatiotemporal scope for 3D videos, and adopt softmax to generate probability distribution of attentional features in a feedforward 3D-CNN architecture; Second, we devise an AE-Res (Attention-Enhanced Residual learning) module, which learns attention-enhanced features in a two-branch residual learning way, also the AE-Res module is lightweight and flexible, so that can be easily embedded into many 3D-CNN architectures; Finally, we embed multiple AE-Res modules into an I3D (Inflated-3D) network, yielding our AE-I3D model, which can be trained in an end-to-end, video-level manner. Different from previous attention networks, our method inflates residual attention from 2D image to 3D video for 3D attention residual learning to enhance spatiotemporal representation. We use RGB-only video data for evaluation on three benchmarks: UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our AE-I3D is effective with competitive performance.

引用

页码：16785 / 16794

页数：10

共 64 条

[21] 3D Convolutional Neural Networks for Human Action Recognition [J].

Ji, Shuiwang ;

Xu, Wei ;

Yang, Ming ;

Yu, Kai .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) :221-231

[22] Large-scale Video Classification with Convolutional Neural Networks [J].

Karpathy, Andrej ;

Toderici, George ;

Shetty, Sanketh ;

Leung, Thomas ;

Sukthankar, Rahul ;

Fei-Fei, Li .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732

[23]

Kay Will, 2017, arXiv

[24]

Krizhevsky A., 2009, HDB SYSTEMIC AUTOIMM

[25] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[26]

Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543

[27] Learning realistic human actions from movies [J].

Laptev, Ivan ;

Marszalek, Marcin ;

Schmid, Cordelia ;

Rozenfeld, Benjamin .

2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, :3222-+

[28] Gradient-based learning applied to document recognition [J].

Lecun, Y ;

Bottou, L ;

Bengio, Y ;

Haffner, P .

PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2278-2324

[29] Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification [J].

Long, Xiang ;

Gan, Chuang ;

de Melo, Gerard ;

Wu, Jiajun ;

Liu, Xiao ;

Wen, Shilei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7834-7843

[30]

Ma C.Y., 2017, ARXIV171106330

← 1 2 3 4 5 6 7 →