Pyramid Spatial-Temporal Graph Transformer for Skeleton-Based Action Recognition

被引:10
作者
Chen, Shuo [1 ]
Xu, Ke [1 ]
Jiang, Xinghao [1 ]
Sun, Tanfeng [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai 200240, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 18期
关键词
skeleton-based action recognition; graph embedding; transformer; spatial-temporal attention; pyramid; NETWORKS;
D O I
10.3390/app12189229
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Although graph convolutional networks (GCNs) have shown their demonstrated ability in skeleton-based action recognition, both the spatial and the temporal connections rely too much on the predefined skeleton graph, which imposes a fixed prior knowledge for the aggregation of high-level semantic information via the graph-based convolution. Some previous GCN-based works introduced dynamic topology (vertex connection relationships) to capture flexible spatial correlations from different actions. Then, the local relationships from both the spatial and temporal domains can be captured by diverse GCNs. This paper introduces a more straightforward and more effective backbone to obtain the spatial-temporal correlation between skeleton joints with a local-global alternation pyramid architecture for skeleton-based action recognition, namely the pyramid spatial-temporal graph transformer (PGT). The PGT consists of four stages with similar architecture but different scales: graph embedding and transformer blocks. We introduce two kinds of transformer blocks in our work: the spatial-temporal transformer block and joint transformer block. In the former, spatial-temporal separated attention (STSA) is proposed to calculate the connection of the global nodes of the graph. Due to the spatial-temporal transformer block, self-attention can be performed on skeleton graphs with long-range temporal and large-scale spatial aggregation. The joint transformer block flattens the tokens in both the spatial and temporal domains to jointly capture the overall spatial-temporal correlations. The PGT is evaluated on three public skeleton datasets: the NTU RGBD 60, NTU RGBD 120 and NW-UCLA datasets. Better or comparable performance with the state of the art (SOTA) shows the effectiveness of our work.
引用
收藏
页数:16
相关论文
共 50 条
[1]   Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition [J].
Banerjee, Avinandan ;
Singh, Pawan Kumar ;
Sarkar, Ram .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) :2206-2216
[2]   SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition [J].
Caetano, Carlos ;
Sena, Jessica ;
Bremond, Francois ;
dos Santos, Jefersson A. ;
Schwartz, William Robson .
2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,
[3]   Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints [J].
Caetano, Carlos ;
Bremond, Francois ;
Schwartz, William Robson .
2019 32ND SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2019, :16-23
[4]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[5]   Dual-domain graph convolutional networks for skeleton-based action recognition [J].
Chen, Shuo ;
Xu, Ke ;
Mi, Zhongjie ;
Jiang, Xinghao ;
Sun, Tanfeng .
MACHINE LEARNING, 2022, 111 (07) :2381-2406
[6]   Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition [J].
Chen, Yuxin ;
Zhang, Ziqi ;
Yuan, Chunfeng ;
Li, Bing ;
Deng, Ying ;
Hu, Weiming .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :13339-13348
[7]   Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition [J].
Cheng, Ke ;
Zhang, Yifan ;
Cao, Congqi ;
Shi, Lei ;
Cheng, Jian ;
Lu, Hanqing .
COMPUTER VISION - ECCV 2020, PT XXIV, 2020, 12369 :536-553
[8]   Skeleton-Based Action Recognition with Shift Graph Convolutional Network [J].
Cheng, Ke ;
Zhang, Yifan ;
He, Xiangyu ;
Chen, Weihan ;
Cheng, Jian ;
Lu, Hanqing .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :180-189
[9]  
Cho S, 2020, IEEE WINT CONF APPL, P624, DOI [10.1109/wacv45572.2020.9093639, 10.1109/WACV45572.2020.9093639]
[10]   VPN: Learning Video-Pose Embedding for Activities of Daily Living [J].
Das, Srijan ;
Sharma, Saurav ;
Dai, Rui ;
Bremond, Francois ;
Thonnat, Monique .
COMPUTER VISION - ECCV 2020, PT IX, 2020, 12354 :72-90