Multi-head attention-based two-stream EfficientNet for action recognition

被引：22

作者：

Zhou, Aihua ^{[1
,2
]}

Ma, Yujun ^{[3
]}

Ji, Wanting ^{[4
]}

Zong, Ming ^{[5
]}

Yang, Pei ^{[1
,2
]}

Wu, Min ^{[6
]}

Liu, Mingzhe ^{[7
]}

机构：

[1] State Grid Smart Grid Res Inst CO LTD, Beijing, Peoples R China

[2] State Grid Key Lab Informat & Network Secur, Nanjing, Peoples R China

[3] Massey Univ, Sch Math & Computat Sci, Auckland, New Zealand

[4] Liaoning Univ, Sch Informat, Shenyang, Peoples R China

[5] Peking Univ, Natl Engn Res Ctr Software Engn, Beijing, Peoples R China

[6] Bejing Inst Comp Technol & Applicat, Beijing, Peoples R China

[7] Chengdu Univ Technol, State Key Lab Geohazard Prevent & Geoenvironm Pro, Chengdu, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2023年 / 29卷 / 02期

关键词：

Action recognition; Multi-head attention; Two-stream network; SPATIAL-TEMPORAL ATTENTION; U-NET; NETWORK; SEGMENTATION; KNOWLEDGE;

D O I：

10.1007/s00530-022-00961-3

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some roughly similar actions in videos such as sneezing and yawning. To solve this problem, we propose a Multi-head Attention-based Two-stream EfficientNet (MAT-EffNet) for action recognition, which can take advantage of the efficient feature extraction of EfficientNet. The proposed network consists of two streams (i.e., a spatial stream and a temporal stream), which first extract the spatial and temporal features from consecutive frames by using EfficientNet. Then, a multi-head attention mechanism is utilized on the two streams to capture the key action information from the extracted features. The final prediction is obtained via a late average fusion, which averages the softmax score of spatial and temporal streams. The proposed MAT-EffNet can focus on the key action information at different frames and compute the attention multiple times, in parallel, to distinguish similar actions. We test the proposed network on the UCF101, HMDB51 and Kinetics-400 datasets. Experimental results show that the MAT-EffNet outperforms other state-of-the-art approaches for action recognition.

引用

页码：487 / 498

页数：12

共 50 条

[21] Combining Multi-Head Attention and Sparse Multi-Head Attention Networks for Session-Based Recommendation
Zhao, Zhiwei
Wang, Xiaoye
Xiao, Yingyuan
[J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[22] Two-Stream Dictionary Learning Architecture for Action Recognition
Xu, Ke
Jiang, Xinghao
Sun, Tanfeng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2017, 27 (03) : 567 - 576
[23] A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition
Chen, Enqing
Bai, Xue
Gao, Lei
Tinega, Haron Chweya
Ding, Yingqiang
[J]. IEEE ACCESS, 2019, 7 : 57267 - 57275
[24] Multi-head attention-based masked sequence model for mapping functional brain networks
He, Mengshen
Hou, Xiangyu
Ge, Enjie
Wang, Zhenwei
Kang, Zili
Qiang, Ning
Zhang, Xin
Ge, Bao
[J]. FRONTIERS IN NEUROSCIENCE, 2023, 17
[25] An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos
Liu, Dan
Ji, Yunfeng
Ye, Mao
Gan, Yan
Zhang, Jianwei
[J]. IEEE ACCESS, 2020, 8 : 61462 - 61470
[26] Improved two-stream model for human action recognition
Yuxuan Zhao
Ka Lok Man
Jeremy Smith
Kamran Siddique
Sheng-Uei Guan
[J]. EURASIP Journal on Image and Video Processing, 2020
[27] EfficientNet and multi-path convolution with multi-head attention network for brain tumor grade classification
Isunuri, B. Venkateswarlu
Kakarla, Jagadeesh
[J]. COMPUTERS & ELECTRICAL ENGINEERING, 2023, 108
[28] Human Action Recognition based on Two-Stream Ind Recurrent Neural Network
Ge Penghua
Zhi Min
[J]. TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069
[29] A two-stream heterogeneous network for action recognition based on skeleton and RGB modalities
Liu, Kai
Gao, Lei
Khan, Naimul Mefraz
Qi, Lin
Guan, Ling
[J]. 23RD IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2021), 2021, : 87 - 91
[30] An Action Recognition Algorithm Based on Two-Stream Deep Learning for Metaverse Applications
Liu, Jiayue
Mao, Tianqi
Huang, Yicheng
He, Dongxuan
[J]. 20TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC 2024, 2024, : 639 - 642

← 1 2 3 4 5 →