MGSAN: multimodal graph self-attention network for skeleton-based action recognitionMGSAN: multimodal graph self-attention network for skeleton-based action recognitionJ. Wang et al.

被引：0

作者：

Junyi Wang ^{[1
]}

Ziao Li ^{[2
]}

Bangli Liu ^{[3
]}

Haibin Cai ^{[1
]}

Mohamad Saada ^{[4
]}

Qinggang Meng ^{[5
]}

机构：

[1] Northeastern University,Faculty of Robot Science and Engineering

[2] The Center of National Railway Intelligent Transportation System Engineering and Technology,Foshan Graduate School of Innovation

[3] Northeastern University,School of Intelligent Systems Engineering

[4] Sun Yat-sen University,Computer Science and Informatics

[5] De Montfort University,Department of Computer Science

[6] Loughborough University,undefined

来源：

Multimedia Systems | 2024年 / 30卷 / 6期

关键词：

Skeleton-based action recognition; Graph convolutional network; Self-attention network;

D O I：

10.1007/s00530-024-01566-8

中图分类号：

学科分类号：

摘要：

Due to the emergence of graph convolutional networks (GCNs), the skeleton-based action recognition has achieved remarkable results. However, the current models for skeleton-based action analysis treat skeleton sequences as a series of graphs, aggregating features of the entire sequence by alternately extracting spatial and temporal features, i.e., using a 2D (spatial features) plus 1D (temporal features) approach for feature extraction. This undoubtedly overlooks the complex spatiotemporal fusion relationships between joints during motion, making it challenging for models to capture the connections between different temporal frames and joints. In this paper, we propose a Multimodal Graph Self-Attention Network (MGSAN), which combines GCNs with self-attention to model the spatiotemporal relationships between skeleton sequences. Firstly, we design graph self-attention (GSA) blocks to capture the intrinsic topology and long-term temporal dependencies between joints. Secondly, we propose a multi-scale spatio-temporal convolutional network for channel-wise topology modeling (CW-TCN) to model short-term smooth temporal information of joint movements. Finally, we propose a multimodal fusion strategy to fuse joint, joint movement, and bone flow, providing the model with a richer set of multimodal features to make better predictions. The proposed MGSAN achieves state-of-the-art performance on three large-scale skeleton-based action recognition datasets, with accuracy of 93.1% on NTU RGB+D 60 cross-subject benchmark, 90.3% on NTU RGB+D 120 cross-subject benchmark, and 97.0% on the NW-UCLA dataset. Code is available at https://github.com/lizaowo/MGSAN.

引用