Multi-head attention-based two-stream EfficientNet for action recognition

被引:21
|
作者
Zhou, Aihua [1 ,2 ]
Ma, Yujun [3 ]
Ji, Wanting [4 ]
Zong, Ming [5 ]
Yang, Pei [1 ,2 ]
Wu, Min [6 ]
Liu, Mingzhe [7 ]
机构
[1] State Grid Smart Grid Res Inst CO LTD, Beijing, Peoples R China
[2] State Grid Key Lab Informat & Network Secur, Nanjing, Peoples R China
[3] Massey Univ, Sch Math & Computat Sci, Auckland, New Zealand
[4] Liaoning Univ, Sch Informat, Shenyang, Peoples R China
[5] Peking Univ, Natl Engn Res Ctr Software Engn, Beijing, Peoples R China
[6] Bejing Inst Comp Technol & Applicat, Beijing, Peoples R China
[7] Chengdu Univ Technol, State Key Lab Geohazard Prevent & Geoenvironm Pro, Chengdu, Peoples R China
关键词
Action recognition; Multi-head attention; Two-stream network; SPATIAL-TEMPORAL ATTENTION; U-NET; NETWORK; SEGMENTATION; KNOWLEDGE;
D O I
10.1007/s00530-022-00961-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some roughly similar actions in videos such as sneezing and yawning. To solve this problem, we propose a Multi-head Attention-based Two-stream EfficientNet (MAT-EffNet) for action recognition, which can take advantage of the efficient feature extraction of EfficientNet. The proposed network consists of two streams (i.e., a spatial stream and a temporal stream), which first extract the spatial and temporal features from consecutive frames by using EfficientNet. Then, a multi-head attention mechanism is utilized on the two streams to capture the key action information from the extracted features. The final prediction is obtained via a late average fusion, which averages the softmax score of spatial and temporal streams. The proposed MAT-EffNet can focus on the key action information at different frames and compute the attention multiple times, in parallel, to distinguish similar actions. We test the proposed network on the UCF101, HMDB51 and Kinetics-400 datasets. Experimental results show that the MAT-EffNet outperforms other state-of-the-art approaches for action recognition.
引用
收藏
页码:487 / 498
页数:12
相关论文
共 50 条
  • [41] Two-Stream Convolutional Networks for Action Recognition in Videos
    Simonyan, Karen
    Zisserman, Andrew
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
  • [42] A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition
    Chen, Enqing
    Bai, Xue
    Gao, Lei
    Tinega, Haron Chweya
    Ding, Yingqiang
    IEEE ACCESS, 2019, 7 : 57267 - 57275
  • [43] Multi-head attention-based masked sequence model for mapping functional brain networks
    He, Mengshen
    Hou, Xiangyu
    Ge, Enjie
    Wang, Zhenwei
    Kang, Zili
    Qiang, Ning
    Zhang, Xin
    Ge, Bao
    FRONTIERS IN NEUROSCIENCE, 2023, 17
  • [44] An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos
    Liu, Dan
    Ji, Yunfeng
    Ye, Mao
    Gan, Yan
    Zhang, Jianwei
    IEEE ACCESS, 2020, 8 : 61462 - 61470
  • [45] Two-stream spatiotemporal networks for skeleton action recognition
    Wang, Lei
    Zhang, Jianwei
    Yang, Shanmin
    Gu, Song
    IET IMAGE PROCESSING, 2023, 17 (11) : 3358 - 3370
  • [46] A Multimode Two-Stream Network for Egocentric Action Recognition
    Li, Ying
    Shen, Jie
    Xiong, Xin
    He, Wei
    Li, Peng
    Yan, Wenjie
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT I, 2021, 12891 : 357 - 368
  • [47] Improved two-stream model for human action recognition
    Yuxuan Zhao
    Ka Lok Man
    Jeremy Smith
    Kamran Siddique
    Sheng-Uei Guan
    EURASIP Journal on Image and Video Processing, 2020
  • [48] Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
    Ngoc-Huynh Ho
    Yang, Hyung-Jeong
    Kim, Soo-Hyung
    Lee, Gueesang
    IEEE ACCESS, 2020, 8 : 61672 - 61686
  • [49] A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition
    Shi, Jing
    Zhang, Yuanyuan
    Wang, Weihang
    Xing, Bin
    Hu, Dasha
    Chen, Liangyin
    APPLIED SCIENCES-BASEL, 2023, 13 (04):
  • [50] EfficientNet and multi-path convolution with multi-head attention network for brain tumor grade classification
    Isunuri, B. Venkateswarlu
    Kakarla, Jagadeesh
    COMPUTERS & ELECTRICAL ENGINEERING, 2023, 108