Deformable graph convolutional transformer for skeleton-based action recognition

被引:3
作者
Chen, Shuo [1 ]
Xu, Ke [1 ]
Zhu, Bo [1 ]
Jiang, Xinghao [1 ]
Sun, Tanfeng [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai 200240, Peoples R China
关键词
Action recognition; Skeleton; Graph convolution networks; Deformable; Transformer; Attention; NETWORKS;
D O I
10.1007/s10489-022-04302-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The critical problem in skeleton-based action recognition is to extract high-level semantics from dynamic changes between skeleton joints. Therefore, Graph Convolutional Networks (GCNs) are widely applied to capture the spatial-temporal information of dynamic joint coordinates by graph-based convolution. However, previous GCNS with fixed graph convolution kernel are limited to the static topology of graphs and the geometric variations of actions. Moreover, the local information of adjacent nodes of the graph is aggregated layer by layer, which increases the model complexity. In this work, a Deformable Graph Convolutional Transformer (DGT) for skeleton-based action recognition is proposed to extract adaptive features via a flexible receptive field that is learnable. In our DGT model, a multiple-input-branches (MIB) architecture is adopted to obtain multiple information, such as joints, bones, and motions. The multiple features are fused in the Transformer Classifier. Then, the Spatial-Temporal Graph Convolution units (STGC) are used to learn a preliminary feature representation indicating both spatial and temporal dependencies on the graph. Next, a Deformable spatial-temporal compound attention backbone is followed, which learns to represent a robust feature via adaptive deformable skeleton features. The adaptive representation is obtained by dynamically adjusting its receptive field owing to the offset-based convolution method. In addition, a self-attention-based transformer classifier (TC) is designed to encode the sequence of features flattened on the spatial and temporal dimensions. The fully-connected attention mechanism further helps the high-level semantic representation by focusing on essential nodes in the graph. We evaluated DGT on two challenging large-scale datasets, NTU-RGBD 60 and NTU-RGBD 120. Experiment results support the efficacy of DGT to optimize the attention for different joints adaptively. A comparable performance but much more efficient than the state-of-the-art demonstrates the effectiveness of the proposed method.
引用
收藏
页码:15390 / 15406
页数:17
相关论文
共 51 条
[1]   Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition [J].
Liu, An-An ;
Su, Yu-Ting ;
Nie, Wei-Zhi ;
Kankanhalli, Mohan .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (01) :102-114
[2]   Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition [J].
Banerjee, Avinandan ;
Singh, Pawan Kumar ;
Sarkar, Ram .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) :2206-2216
[3]   SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition [J].
Caetano, Carlos ;
Sena, Jessica ;
Bremond, Francois ;
dos Santos, Jefersson A. ;
Schwartz, William Robson .
2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,
[4]   Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints [J].
Caetano, Carlos ;
Bremond, Francois ;
Schwartz, William Robson .
2019 32ND SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2019, :16-23
[5]   Dual-domain graph convolutional networks for skeleton-based action recognition [J].
Chen, Shuo ;
Xu, Ke ;
Mi, Zhongjie ;
Jiang, Xinghao ;
Sun, Tanfeng .
MACHINE LEARNING, 2022, 111 (07) :2381-2406
[6]   Skeleton-Based Action Recognition with Shift Graph Convolutional Network [J].
Cheng, Ke ;
Zhang, Yifan ;
He, Xiangyu ;
Chen, Weihan ;
Cheng, Jian ;
Lu, Hanqing .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :180-189
[7]  
Cho S, 2020, IEEE WINT CONF APPL, P624, DOI [10.1109/WACV45572.2020.9093639, 10.1109/wacv45572.2020.9093639]
[8]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Dosovitskiy A., 2021, INT C LEARNING REPRE, P1