Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

被引:11
|
作者
Shi, Zhensheng [1 ]
Cao, Liangjie [1 ]
Guan, Cheng [1 ]
Zheng, Haiyong [1 ]
Gu, Zhaorui [1 ]
Yu, Zhibin [1 ]
Zheng, Bing [1 ]
机构
[1] Ocean Univ China, Dept Elect Engn, Qingdao 266100, Peoples R China
来源
IEEE ACCESS | 2020年 / 8卷 / 08期
基金
中国国家自然科学基金;
关键词
Action recognition; video understanding; spatiotemporal representation; visual attention; 3D-CNN; residual learning;
D O I
10.1109/ACCESS.2020.2968024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning spatiotemporal features via 3D-CNN (3D Convolutional Neural Network) models has been regarded as an effective approach for action recognition. In this paper, we explore visual attention mechanism for video analysis and propose a novel 3D-CNN model, dubbed AE-I3D (Attention-Enhanced Inflated-3D Network), for learning attention-enhanced spatiotemporal representation. The contribution of our AE-I3D is threefold: First, we inflate soft attention in spatiotemporal scope for 3D videos, and adopt softmax to generate probability distribution of attentional features in a feedforward 3D-CNN architecture; Second, we devise an AE-Res (Attention-Enhanced Residual learning) module, which learns attention-enhanced features in a two-branch residual learning way, also the AE-Res module is lightweight and flexible, so that can be easily embedded into many 3D-CNN architectures; Finally, we embed multiple AE-Res modules into an I3D (Inflated-3D) network, yielding our AE-I3D model, which can be trained in an end-to-end, video-level manner. Different from previous attention networks, our method inflates residual attention from 2D image to 3D video for 3D attention residual learning to enhance spatiotemporal representation. We use RGB-only video data for evaluation on three benchmarks: UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our AE-I3D is effective with competitive performance.
引用
收藏
页码:16785 / 16794
页数:10
相关论文
共 50 条
  • [41] Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition
    Nitta, Tomoya
    Hirakawa, Tsubasa
    Fujiyoshi, Hironobu
    Tamaki, Toru
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2023, E106D (03) : 391 - 400
  • [42] Action recognition and tracking via deep representation extraction and motion bases learning
    Li, Hao-Ting
    Liu, Yung-Pin
    Chang, Yun-Kai
    Chiang, Chen-Kuo
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (09) : 11845 - 11864
  • [43] Learning Self-Correlation in Space and Time as Motion Representation for Action Recognition
    Zhang, Yi
    Li, Yuchang
    Liu, Mingwei
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1747 - 1751
  • [44] JOINT LEARNING ON THE HIERARCHY REPRESENTATION FOR FINE-GRAINED HUMAN ACTION RECOGNITION
    Leong, Mei Chee
    Tan, Hui Li
    Zhang, Haosong
    Li, Liyuan
    Lin, Feng
    Lim, Joo Hwee
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1059 - 1063
  • [45] Dynamic Representation Learning for Video Action Recognition Using Temporal Residual Networks
    Kong, Yongqiang
    Huang, Jianhui
    Huang, Shanshan
    Wei, Zhengang
    Wang, Shengke
    2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 331 - 337
  • [46] Multi-view representation learning for multi-view action recognition
    Hao, Tong
    Wu, Dan
    Wang, Qian
    Sun, Jin-Sheng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2017, 48 : 453 - 460
  • [47] Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition
    Kim, Hyun-Woo
    Choi, Yong-Suk
    SENSORS, 2024, 24 (21)
  • [48] Action recognition and tracking via deep representation extraction and motion bases learning
    Hao-Ting Li
    Yung-Pin Liu
    Yun-Kai Chang
    Chen-Kuo Chiang
    Multimedia Tools and Applications, 2022, 81 : 11845 - 11864
  • [49] STAR: Efficient SpatioTemporal Modeling for Action Recognition
    Abhijeet Kumar
    Samuel Abrams
    Abhishek Kumar
    Vijaykrishnan Narayanan
    Circuits, Systems, and Signal Processing, 2023, 42 : 705 - 723
  • [50] Learning 3D Skeletal Representation From Transformer for Action Recognition
    Cha, Junuk
    Saqlain, Muhammad
    Kim, Donguk
    Lee, Seungeun
    Lee, Seongyeong
    Baek, Seungryul
    IEEE ACCESS, 2022, 10 : 67541 - 67550