Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

被引:11
|
作者
Shi, Zhensheng [1 ]
Cao, Liangjie [1 ]
Guan, Cheng [1 ]
Zheng, Haiyong [1 ]
Gu, Zhaorui [1 ]
Yu, Zhibin [1 ]
Zheng, Bing [1 ]
机构
[1] Ocean Univ China, Dept Elect Engn, Qingdao 266100, Peoples R China
来源
IEEE ACCESS | 2020年 / 8卷 / 08期
基金
中国国家自然科学基金;
关键词
Action recognition; video understanding; spatiotemporal representation; visual attention; 3D-CNN; residual learning;
D O I
10.1109/ACCESS.2020.2968024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning spatiotemporal features via 3D-CNN (3D Convolutional Neural Network) models has been regarded as an effective approach for action recognition. In this paper, we explore visual attention mechanism for video analysis and propose a novel 3D-CNN model, dubbed AE-I3D (Attention-Enhanced Inflated-3D Network), for learning attention-enhanced spatiotemporal representation. The contribution of our AE-I3D is threefold: First, we inflate soft attention in spatiotemporal scope for 3D videos, and adopt softmax to generate probability distribution of attentional features in a feedforward 3D-CNN architecture; Second, we devise an AE-Res (Attention-Enhanced Residual learning) module, which learns attention-enhanced features in a two-branch residual learning way, also the AE-Res module is lightweight and flexible, so that can be easily embedded into many 3D-CNN architectures; Finally, we embed multiple AE-Res modules into an I3D (Inflated-3D) network, yielding our AE-I3D model, which can be trained in an end-to-end, video-level manner. Different from previous attention networks, our method inflates residual attention from 2D image to 3D video for 3D attention residual learning to enhance spatiotemporal representation. We use RGB-only video data for evaluation on three benchmarks: UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our AE-I3D is effective with competitive performance.
引用
收藏
页码:16785 / 16794
页数:10
相关论文
共 50 条
  • [21] Learning Discriminative Feature Representation for Open Set Action Recognition
    Zhang, Hongjie
    Liu, Yi
    Wang, Yali
    Wang, Limin
    Qiao, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7696 - 7705
  • [22] A generically Contrastive Spatiotemporal Representation Enhancement for 3D skeleton action recognition
    Zhang, Shaojie
    Yin, Jianqin
    Dang, Yonghao
    PATTERN RECOGNITION, 2025, 164
  • [23] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
    Alfasly, Saghir
    Chui, Charles K.
    Jiang, Qingtang
    Lu, Jian
    Xu, Chen
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
  • [24] Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition
    Liu, Dichao
    Wang, Yu
    Kato, Jien
    VISAPP: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 4, 2019, : 311 - 318
  • [25] SELFGAIT: A SPATIOTEMPORAL REPRESENTATION LEARNING METHOD FOR SELF-SUPERVISED GAIT RECOGNITION
    Liu, Yiqun
    Zeng, Yi
    Pu, Jian
    Shan, Hongming
    He, Peiyang
    Zhang, Junping
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2570 - 2574
  • [26] Spatiotemporal Features for Action Recognition and Salient Event Detection
    Rapantzikos, Konstantinos
    Avrithis, Yannis
    Kollias, Stefanos
    COGNITIVE COMPUTATION, 2011, 3 (01) : 167 - 184
  • [27] Deep learning network model based on fusion of spatiotemporal features for action recognition
    Yang, Ge
    Zou, Wu-xing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) : 9875 - 9896
  • [28] Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition
    Yang Li
    Junyong Ye
    Tongqing Wang
    Shijian Huang
    The Visual Computer, 2015, 31 : 1383 - 1394
  • [29] Deep learning network model based on fusion of spatiotemporal features for action recognition
    Ge Yang
    Wu-xing Zou
    Multimedia Tools and Applications, 2022, 81 : 9875 - 9896
  • [30] Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition
    Li, Yang
    Ye, Junyong
    Wang, Tongqing
    Huang, Shijian
    VISUAL COMPUTER, 2015, 31 (10) : 1383 - 1394