A hybrid attention-guided ConvNeXt-GRU network for action recognition

被引：9

作者：

An, Yiyuan ^{[1
]}

Yi, Yingmin ^{[1
]}

Han, Xiaoyong ^{[2
]}

Wu, Li ^{[1
]}

Su, Chunyi ^{[3
]}

Liu, Bojun ^{[1
]}

Xue, Xianghong ^{[1
]}

Li, Yankai ^{[1
]}

机构：

[1] Xian Univ Technol, Sch Automat & Informat Engn, Xian 710048, Peoples R China

[2] Tsinghua Univ, Xingjian Coll, Beijing 100084, Peoples R China

[3] Concordia Univ, Montreal, PQ H3B 1R6, Canada

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 133卷

关键词：

Action recognition; Selective kernel network; Efficient channel attention; ConvNeXt; Gated recurrent unit; LSTM;

D O I：

10.1016/j.engappai.2024.108243

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the digital age, with the continuous emergence of large-scale video data, video understanding has become increasingly important. As a core domain, action recognition has garnered widespread attention. However, video exhibits high-dimensional properties and contains human action information at multiple scales, which makes conventional attention mechanisms difficult to capture complex action information. To improve the performance of action recognition, a Hybrid Attention-guided ConvNeXt-GRU Network (HACG) is proposed. Specifically, a Novel Attention Mechanism (ANM) is constructed by integrating a parameter-free attention module into ConvNeXt, enabling the preliminary extraction of important features without the addition of extra parameters. Then, a Multiscale Hybrid Attention Module (MHAM) adopts an improved and efficient Selective Kernel Network (SKNet) to adaptively calibrate channel features. In this way, the module enhances the model's ability to perceive features at different scales while improving the correlation between channels. Furthermore, MHAM incorporates an Atrous Spatial Pyramid Pooling (ASPP) to extract local and global information from different regions. Finally, MHAM is integrated with the Gated Recurrent Unit (GRU) to capture the interdependence between space and time. Experimental results show that HACG exhibits superior competitiveness compared with the state-of-the-art on the UCF-101, HMDB-51, and Kinetics-400 datasets. This indicates that HACG can more effectively capture important features to suppress noise interference while also having a lower computational load, which makes HACG a highly promising choice for action recognition tasks.

引用

页数：13

共 69 条

[61] Moving Foreground-Aware Visual Attention and Key Volume Mining for Human Action Recognition [J].

Zhang, Junxuan ;

Hu, Haifeng ;

Lu, Xinlong .

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (03)

[62] Fault Diagnosis for Electro-Mechanical Actuators Based on STL-HSTA-GRU and SM [J].

Zhang, Xiaoyu ;

Tang, Liwei ;

Chen, Jiusheng .

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70

[63]

Zhang YY, 2021, 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), P13557, DOI [10.1109/ICCV48922.2021.01332, 10.1109/iccv48922.2021.01332]

[64] Multi-scale signed recurrence plot based time series classification using inception architectural networks [J].

Zhang, Ye ;

Hou, Yi ;

OuYang, Kewei ;

Zhou, Shilin .

PATTERN RECOGNITION, 2022, 123

[65] Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions [J].

Zhang, Zufan ;

Lv, Zongming ;

Gan, Chenquan ;

Zhu, Qingyi .

NEUROCOMPUTING, 2020, 410 :304-316

[66] MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition [J].

Zhou, Yizhou ;

Sun, Xiaoyan ;

Zha, Zheng-Jun ;

Zeng, Wenjun .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :449-458

[67] MFNet: A Novel Multilevel Feature Fusion Network With Multibranch Structure for Surface Defect Detection [J].

Zhu, Jiangping ;

He, Guohuan ;

Zhou, Pei .

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72

[68] V-SlowFast Network for Efficient Visual Sound Separation [J].

Zhu, Lingyu ;

Rahtu, Esa .

2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :2182-2192

[69] ECO: Efficient Convolutional Network for Online Video Understanding [J].

Zolfaghari, Mohammadreza ;

Singh, Kamaljeet ;

Brox, Thomas .

COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 :713-730

← 1 2 3 4 5 6 7 →