Multi-head attention-based two-stream EfficientNet for action recognition

被引:21
|
作者
Zhou, Aihua [1 ,2 ]
Ma, Yujun [3 ]
Ji, Wanting [4 ]
Zong, Ming [5 ]
Yang, Pei [1 ,2 ]
Wu, Min [6 ]
Liu, Mingzhe [7 ]
机构
[1] State Grid Smart Grid Res Inst CO LTD, Beijing, Peoples R China
[2] State Grid Key Lab Informat & Network Secur, Nanjing, Peoples R China
[3] Massey Univ, Sch Math & Computat Sci, Auckland, New Zealand
[4] Liaoning Univ, Sch Informat, Shenyang, Peoples R China
[5] Peking Univ, Natl Engn Res Ctr Software Engn, Beijing, Peoples R China
[6] Bejing Inst Comp Technol & Applicat, Beijing, Peoples R China
[7] Chengdu Univ Technol, State Key Lab Geohazard Prevent & Geoenvironm Pro, Chengdu, Peoples R China
关键词
Action recognition; Multi-head attention; Two-stream network; SPATIAL-TEMPORAL ATTENTION; U-NET; NETWORK; SEGMENTATION; KNOWLEDGE;
D O I
10.1007/s00530-022-00961-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some roughly similar actions in videos such as sneezing and yawning. To solve this problem, we propose a Multi-head Attention-based Two-stream EfficientNet (MAT-EffNet) for action recognition, which can take advantage of the efficient feature extraction of EfficientNet. The proposed network consists of two streams (i.e., a spatial stream and a temporal stream), which first extract the spatial and temporal features from consecutive frames by using EfficientNet. Then, a multi-head attention mechanism is utilized on the two streams to capture the key action information from the extracted features. The final prediction is obtained via a late average fusion, which averages the softmax score of spatial and temporal streams. The proposed MAT-EffNet can focus on the key action information at different frames and compute the attention multiple times, in parallel, to distinguish similar actions. We test the proposed network on the UCF101, HMDB51 and Kinetics-400 datasets. Experimental results show that the MAT-EffNet outperforms other state-of-the-art approaches for action recognition.
引用
收藏
页码:487 / 498
页数:12
相关论文
共 50 条
  • [21] Two-Stream Adaptive Weight Convolutional Neural Network Based on Spatial Attention for Human Action Recognition
    Chen, Guanzhou
    Yao, Lu
    Xu, Jingting
    Liu, Qianxi
    Chen, Shengyong
    INTELLIGENT ROBOTICS AND APPLICATIONS (ICIRA 2022), PT IV, 2022, 13458 : 319 - 330
  • [22] Temporal Shift and Spatial Attention-Based Two-Stream Network for Traffic Risk Assessment
    Liu, Chunsheng
    Li, Zijian
    Chang, Faliang
    Li, Shuang
    Xie, Jincan
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (08) : 12518 - 12530
  • [23] Skeleton-based Action Recognition Method with Two-Stream Multi-relational GCNs
    Liu F.
    Qiao J.-Z.
    Dai Q.
    Shi X.-B.
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2021, 42 (06): : 768 - 774
  • [24] Human Action Recognition Based on Improved Two-Stream Convolution Network
    Wang, Zhongwen
    Lu, Haozhu
    Jin, Junlan
    Hu, Kai
    APPLIED SCIENCES-BASEL, 2022, 12 (12):
  • [25] Speech recognition based on the transformer's multi-head attention in Arabic
    Mahmoudi O.
    Filali-Bouami M.
    Benchat M.
    International Journal of Speech Technology, 2024, 27 (01) : 211 - 223
  • [26] Human Action Recognition Based on a Two-stream Convolutional Network Classifier
    Silva, Vincius de Oliveira
    Vidal, Flavio de Barros
    Soares Romariz, Alexandre Ricardo
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 774 - 778
  • [27] Multi-Head Attention-Based Hybrid Deep Neural Network for Aeroengine Risk Assessment
    Li, Jian-Hang
    Gao, Xin-Yue
    Lu, Xiang
    Liu, Guo-Dong
    IEEE ACCESS, 2023, 11 : 113376 - 113389
  • [28] Internal defects inspection of arc magnets using multi-head attention-based CNN
    Li, Qiang
    Huang, Qinyuan
    Yang, Tian
    Zhou, Ying
    Yang, Kun
    Song, Hong
    MEASUREMENT, 2022, 202
  • [29] Self Multi-Head Attention-based Convolutional Neural Networks for fake news detection
    Fang, Yong
    Gao, Jian
    Huang, Cheng
    Peng, Hua
    Wu, Runpu
    PLOS ONE, 2019, 14 (09):
  • [30] Two-stream Deep Representation for Human Action Recognition
    Ghrab, Najla Bouarada
    Fendri, Emna
    Hammami, Mohamed
    FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084