Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

被引：11

作者：

Shi, Zhensheng ^{[1
]}

Cao, Liangjie ^{[1
]}

Guan, Cheng ^{[1
]}

Zheng, Haiyong ^{[1
]}

Gu, Zhaorui ^{[1
]}

Yu, Zhibin ^{[1
]}

Zheng, Bing ^{[1
]}

机构：

[1] Ocean Univ China, Dept Elect Engn, Qingdao 266100, Peoples R China

来源：

IEEE ACCESS | 2020年 / 8卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Action recognition; video understanding; spatiotemporal representation; visual attention; 3D-CNN; residual learning;

D O I：

10.1109/ACCESS.2020.2968024

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Learning spatiotemporal features via 3D-CNN (3D Convolutional Neural Network) models has been regarded as an effective approach for action recognition. In this paper, we explore visual attention mechanism for video analysis and propose a novel 3D-CNN model, dubbed AE-I3D (Attention-Enhanced Inflated-3D Network), for learning attention-enhanced spatiotemporal representation. The contribution of our AE-I3D is threefold: First, we inflate soft attention in spatiotemporal scope for 3D videos, and adopt softmax to generate probability distribution of attentional features in a feedforward 3D-CNN architecture; Second, we devise an AE-Res (Attention-Enhanced Residual learning) module, which learns attention-enhanced features in a two-branch residual learning way, also the AE-Res module is lightweight and flexible, so that can be easily embedded into many 3D-CNN architectures; Finally, we embed multiple AE-Res modules into an I3D (Inflated-3D) network, yielding our AE-I3D model, which can be trained in an end-to-end, video-level manner. Different from previous attention networks, our method inflates residual attention from 2D image to 3D video for 3D attention residual learning to enhance spatiotemporal representation. We use RGB-only video data for evaluation on three benchmarks: UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our AE-I3D is effective with competitive performance.

引用

页码：16785 / 16794

页数：10

共 50 条

[21] Learning Discriminative Feature Representation for Open Set Action Recognition
Zhang, Hongjie
Liu, Yi
Wang, Yali
Wang, Limin
Qiao, Yu
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7696 - 7705
[22] A generically Contrastive Spatiotemporal Representation Enhancement for 3D skeleton action recognition
Zhang, Shaojie
Yin, Jianqin
Dang, Yonghao
PATTERN RECOGNITION, 2025, 164
[23] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
Alfasly, Saghir
Chui, Charles K.
Jiang, Qingtang
Lu, Jian
Xu, Chen
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
[24] Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition
Liu, Dichao
Wang, Yu
Kato, Jien
VISAPP: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 4, 2019, : 311 - 318
[25] SELFGAIT: A SPATIOTEMPORAL REPRESENTATION LEARNING METHOD FOR SELF-SUPERVISED GAIT RECOGNITION
Liu, Yiqun
Zeng, Yi
Pu, Jian
Shan, Hongming
He, Peiyang
Zhang, Junping
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2570 - 2574
[26] Spatiotemporal Features for Action Recognition and Salient Event Detection
Rapantzikos, Konstantinos
Avrithis, Yannis
Kollias, Stefanos
COGNITIVE COMPUTATION, 2011, 3 (01) : 167 - 184
[27] Deep learning network model based on fusion of spatiotemporal features for action recognition
Yang, Ge
Zou, Wu-xing
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) : 9875 - 9896
[28] Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition
Yang Li
Junyong Ye
Tongqing Wang
Shijian Huang
The Visual Computer, 2015, 31 : 1383 - 1394
[29] Deep learning network model based on fusion of spatiotemporal features for action recognition
Ge Yang
Wu-xing Zou
Multimedia Tools and Applications, 2022, 81 : 9875 - 9896
[30] Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition
Li, Yang
Ye, Junyong
Wang, Tongqing
Huang, Shijian
VISUAL COMPUTER, 2015, 31 (10) : 1383 - 1394

← 1 2 3 4 5 →