A hybrid attention-guided ConvNeXt-GRU network for action recognition

被引：9

作者：

An, Yiyuan ^{[1
]}

Yi, Yingmin ^{[1
]}

Han, Xiaoyong ^{[2
]}

Wu, Li ^{[1
]}

Su, Chunyi ^{[3
]}

Liu, Bojun ^{[1
]}

Xue, Xianghong ^{[1
]}

Li, Yankai ^{[1
]}

机构：

[1] Xian Univ Technol, Sch Automat & Informat Engn, Xian 710048, Peoples R China

[2] Tsinghua Univ, Xingjian Coll, Beijing 100084, Peoples R China

[3] Concordia Univ, Montreal, PQ H3B 1R6, Canada

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 133卷

关键词：

Action recognition; Selective kernel network; Efficient channel attention; ConvNeXt; Gated recurrent unit; LSTM;

D O I：

10.1016/j.engappai.2024.108243

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the digital age, with the continuous emergence of large-scale video data, video understanding has become increasingly important. As a core domain, action recognition has garnered widespread attention. However, video exhibits high-dimensional properties and contains human action information at multiple scales, which makes conventional attention mechanisms difficult to capture complex action information. To improve the performance of action recognition, a Hybrid Attention-guided ConvNeXt-GRU Network (HACG) is proposed. Specifically, a Novel Attention Mechanism (ANM) is constructed by integrating a parameter-free attention module into ConvNeXt, enabling the preliminary extraction of important features without the addition of extra parameters. Then, a Multiscale Hybrid Attention Module (MHAM) adopts an improved and efficient Selective Kernel Network (SKNet) to adaptively calibrate channel features. In this way, the module enhances the model's ability to perceive features at different scales while improving the correlation between channels. Furthermore, MHAM incorporates an Atrous Spatial Pyramid Pooling (ASPP) to extract local and global information from different regions. Finally, MHAM is integrated with the Gated Recurrent Unit (GRU) to capture the interdependence between space and time. Experimental results show that HACG exhibits superior competitiveness compared with the state-of-the-art on the UCF-101, HMDB-51, and Kinetics-400 datasets. This indicates that HACG can more effectively capture important features to suppress noise interference while also having a lower computational load, which makes HACG a highly promising choice for action recognition tasks.

引用

页数：13

共 69 条

[1] Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos [J].

Agethen, Sebastian ;

Hsu, Winston H. .

IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (03) :819-829

[2] Crack Segmentation Network using Additive Attention Gate-CSN-II [J].

Ali, Raza ;

Chuah, Joon Huang ;

Abu Talip, Mohamad Sofian ;

Mokhtar, Norrima ;

Shoaib, Muhammad Ali .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 114

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4] Feature fusion and kernel selective in Inception-v4 network [J].

Chen, Feng ;

Wei, Jiangshu ;

Xue, Bing ;

Zhang, Mengjie .

APPLIED SOFT COMPUTING, 2022, 119

[5] AGPN: Action Granularity Pyramid Network for Video Action Recognition [J].

Chen, Yatong ;

Ge, Hongwei ;

Liu, Yuxuan ;

Cai, Xinye ;

Sun, Liang .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) :3912-3923

[6] Using a novel clustered 3D-CNN model for improving crop future price prediction [J].

Cheung, Liege ;

Wang, Yun ;

Lau, Adela S. M. ;

Chan, Rogers M. C. .

KNOWLEDGE-BASED SYSTEMS, 2023, 260

[7] ResLT: Residual Learning for Long-Tailed Recognition [J].

Cui, Jiequan ;

Liu, Shu ;

Tian, Zhuotao ;

Zhong, Zhisheng ;

Jia, Jiaya .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) :3695-3706

[8] SlowFast Networks for Video Recognition [J].

Feichtenhofer, Christoph ;

Fan, Haoqi ;

Malik, Jitendra ;

He, Kaiming .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210

[9] DanHAR: Dual Attention Network for multimodal human activity recognition using wearable sensors [J].

Gao, Wenbin ;

Zhang, Lei ;

Teng, Qi ;

He, Jun ;

Wu, Hao .

APPLIED SOFT COMPUTING, 2021, 111

[10] End-to-End Blind Video Quality Assessment Based on Visual and Memory Attention Modeling [J].

Guan, Xiaodi ;

Li, Fan ;

Zhang, Yangfan ;

Cosman, Pamela C. .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :5206-5221

← 1 2 3 4 5 6 7 →