Multi-head attention-based two-stream EfficientNet for action recognition

被引:21
|
作者
Zhou, Aihua [1 ,2 ]
Ma, Yujun [3 ]
Ji, Wanting [4 ]
Zong, Ming [5 ]
Yang, Pei [1 ,2 ]
Wu, Min [6 ]
Liu, Mingzhe [7 ]
机构
[1] State Grid Smart Grid Res Inst CO LTD, Beijing, Peoples R China
[2] State Grid Key Lab Informat & Network Secur, Nanjing, Peoples R China
[3] Massey Univ, Sch Math & Computat Sci, Auckland, New Zealand
[4] Liaoning Univ, Sch Informat, Shenyang, Peoples R China
[5] Peking Univ, Natl Engn Res Ctr Software Engn, Beijing, Peoples R China
[6] Bejing Inst Comp Technol & Applicat, Beijing, Peoples R China
[7] Chengdu Univ Technol, State Key Lab Geohazard Prevent & Geoenvironm Pro, Chengdu, Peoples R China
关键词
Action recognition; Multi-head attention; Two-stream network; SPATIAL-TEMPORAL ATTENTION; U-NET; NETWORK; SEGMENTATION; KNOWLEDGE;
D O I
10.1007/s00530-022-00961-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some roughly similar actions in videos such as sneezing and yawning. To solve this problem, we propose a Multi-head Attention-based Two-stream EfficientNet (MAT-EffNet) for action recognition, which can take advantage of the efficient feature extraction of EfficientNet. The proposed network consists of two streams (i.e., a spatial stream and a temporal stream), which first extract the spatial and temporal features from consecutive frames by using EfficientNet. Then, a multi-head attention mechanism is utilized on the two streams to capture the key action information from the extracted features. The final prediction is obtained via a late average fusion, which averages the softmax score of spatial and temporal streams. The proposed MAT-EffNet can focus on the key action information at different frames and compute the attention multiple times, in parallel, to distinguish similar actions. We test the proposed network on the UCF101, HMDB51 and Kinetics-400 datasets. Experimental results show that the MAT-EffNet outperforms other state-of-the-art approaches for action recognition.
引用
收藏
页码:487 / 498
页数:12
相关论文
共 50 条
  • [31] Multi-head attention-based model for reconstructing continuous missing time series data
    Wu, Huafeng
    Zhang, Yuxuan
    Liang, Linian
    Mei, Xiaojun
    Han, Dezhi
    Han, Bing
    Weng, Tien-Hsiung
    Li, Kuan-Ching
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (18): : 20684 - 20711
  • [32] Multi-head Attention-Based Masked Sequence Model for Mapping Functional Brain Networks
    He, Mengshen
    Hou, Xiangyu
    Wang, Zhenwei
    Kang, Zili
    Zhang, Xin
    Qiang, Ning
    Ge, Bao
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT I, 2022, 13431 : 295 - 304
  • [33] Improved two-stream model for human action recognition
    Zhao, Yuxuan
    Man, Ka Lok
    Smith, Jeremy
    Siddique, Kamran
    Guan, Sheng-Uei
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2020, 2020 (01)
  • [34] Hidden Two-Stream Convolutional Networks for Action Recognition
    Zhu, Yi
    Lan, Zhenzhong
    Newsam, Shawn
    Hauptmann, Alexander
    COMPUTER VISION - ACCV 2018, PT III, 2019, 11363 : 363 - 378
  • [35] Multi-head attention-based model for reconstructing continuous missing time series data
    Huafeng Wu
    Yuxuan Zhang
    Linian Liang
    Xiaojun Mei
    Dezhi Han
    Bing Han
    Tien-Hsiung Weng
    Kuan-Ching Li
    The Journal of Supercomputing, 2023, 79 : 20684 - 20711
  • [36] Enhancing Recommendation Capabilities Using Multi-Head Attention-Based Federated Knowledge Distillation
    Wu, Aming
    Kwon, Young-Woo
    IEEE ACCESS, 2023, 11 : 45850 - 45861
  • [37] Combining Multi-Head Attention and Sparse Multi-Head Attention Networks for Session-Based Recommendation
    Zhao, Zhiwei
    Wang, Xiaoye
    Xiao, Yingyuan
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [38] A heterogeneous two-stream network for human action recognition
    Liao, Shengbin
    Wang, Xiaofeng
    Yang, ZongKai
    AI COMMUNICATIONS, 2023, 36 (03) : 219 - 233
  • [39] Two-Stream Dictionary Learning Architecture for Action Recognition
    Xu, Ke
    Jiang, Xinghao
    Sun, Tanfeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2017, 27 (03) : 567 - 576
  • [40] Two-Stream Gated Fusion ConvNets for Action Recognition
    Zhu, Jiagang
    Zou, Wei
    Zhu, Zheng
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 597 - 602