Multi-head attention-based two-stream EfficientNet for action recognition

被引:21
|
作者
Zhou, Aihua [1 ,2 ]
Ma, Yujun [3 ]
Ji, Wanting [4 ]
Zong, Ming [5 ]
Yang, Pei [1 ,2 ]
Wu, Min [6 ]
Liu, Mingzhe [7 ]
机构
[1] State Grid Smart Grid Res Inst CO LTD, Beijing, Peoples R China
[2] State Grid Key Lab Informat & Network Secur, Nanjing, Peoples R China
[3] Massey Univ, Sch Math & Computat Sci, Auckland, New Zealand
[4] Liaoning Univ, Sch Informat, Shenyang, Peoples R China
[5] Peking Univ, Natl Engn Res Ctr Software Engn, Beijing, Peoples R China
[6] Bejing Inst Comp Technol & Applicat, Beijing, Peoples R China
[7] Chengdu Univ Technol, State Key Lab Geohazard Prevent & Geoenvironm Pro, Chengdu, Peoples R China
关键词
Action recognition; Multi-head attention; Two-stream network; SPATIAL-TEMPORAL ATTENTION; U-NET; NETWORK; SEGMENTATION; KNOWLEDGE;
D O I
10.1007/s00530-022-00961-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some roughly similar actions in videos such as sneezing and yawning. To solve this problem, we propose a Multi-head Attention-based Two-stream EfficientNet (MAT-EffNet) for action recognition, which can take advantage of the efficient feature extraction of EfficientNet. The proposed network consists of two streams (i.e., a spatial stream and a temporal stream), which first extract the spatial and temporal features from consecutive frames by using EfficientNet. Then, a multi-head attention mechanism is utilized on the two streams to capture the key action information from the extracted features. The final prediction is obtained via a late average fusion, which averages the softmax score of spatial and temporal streams. The proposed MAT-EffNet can focus on the key action information at different frames and compute the attention multiple times, in parallel, to distinguish similar actions. We test the proposed network on the UCF101, HMDB51 and Kinetics-400 datasets. Experimental results show that the MAT-EffNet outperforms other state-of-the-art approaches for action recognition.
引用
收藏
页码:487 / 498
页数:12
相关论文
共 50 条
  • [1] Multi-head attention-based two-stream EfficientNet for action recognition
    Aihua Zhou
    Yujun Ma
    Wanting Ji
    Ming Zong
    Pei Yang
    Min Wu
    Mingzhe Liu
    Multimedia Systems, 2023, 29 : 487 - 498
  • [2] Cascade multi-head attention networks for action recognition
    Wang, Jiaze
    Peng, Xiaojiang
    Qiao, Yu
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2020, 192
  • [3] Multi-Head Attention-Based Spectrum Sensing for Radio
    Devarakonda, B. V. Ravisankar
    Nandanavam, Venkateswararao
    INTERNATIONAL JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING SYSTEMS, 2023, 14 (02) : 135 - 143
  • [4] Human action recognition using two-stream attention based LSTM networks
    Dai, Cheng
    Liu, Xingang
    Lai, Jinfeng
    APPLIED SOFT COMPUTING, 2020, 86
  • [5] Two-stream Graph Attention Convolutional for Video Action Recognition
    Zhang, Deyuan
    Gao, Hongwei
    Dai, Hailong
    Shi, Xiangbin
    2021 IEEE 15TH INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (BIGDATASE 2021), 2021, : 23 - 27
  • [6] Multiscaled Multi-Head Attention-Based Video Transformer Network for Hand Gesture Recognition
    Garg, Mallika
    Ghosh, Debashis
    Pradhan, Pyari Mohan
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 80 - 84
  • [7] A novel two-stream multi-head self-attention convolutional neural network for bearing fault diagnosis
    Ren, Hang
    Liu, Shaogang
    Wei, Fengmei
    Qiu, Bo
    Zhao, Dan
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART C-JOURNAL OF MECHANICAL ENGINEERING SCIENCE, 2024, 238 (11) : 5393 - 5405
  • [8] Improving CRNN with EfficientNet-like feature extractor and multi-head attention for text recognition
    Dinh Viet Sang
    Le Tran Bao Cuong
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 285 - 290
  • [9] A multi-head adjacent attention-based pyramid layered model for nested named entity recognition
    Shengmin Cui
    Inwhee Joe
    Neural Computing and Applications, 2023, 35 : 2561 - 2574
  • [10] Two-Stream Adaptive Attention Graph Convolutional Networks for Action Recognition
    Du Q.
    Xiang Z.
    Tian L.
    Yu L.
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2022, 50 (12): : 20 - 29