Decoupled spatio-temporal grouping transformer for skeleton-based action recognition

被引:1
作者
Sun, Shengkun [1 ]
Jia, Zihao [1 ]
Zhu, Yisheng [1 ]
Liu, Guangcan [2 ]
Yu, Zhengtao [3 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Automat, 219 NingLiu Rd, Nanjing 210000, Jiangsu, Peoples R China
[2] Southeast Univ, Sch Automat, 2 Southeast Univ Rd, Nanjing 210018, Jiangsu, Peoples R China
[3] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, 727 Jingming Rd South, Kunming 650500, Yunnan, Peoples R China
关键词
Skeleton-based action recognition; Transformer; Decoupled; NETWORK;
D O I
10.1007/s00371-023-03132-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Capturing correlations between joints is crucial in skeleton-based action recognition tasks. Transformer has demonstrated its capability in capturing such correlations. However, conventional Transformer-based approaches model the relationships between joints in a unified spatio-temporal dimension, disregarding the unique semantic information that exists in both the spatial and temporal dimensions of skeleton sequences. To address this issue, we present a novel decoupled spatio-temporal grouping Transformer (DSTGFormer) model. The skeleton sequence is split into multiple spatio-temporal groups, each containing a set of consecutive frames. The spatio-temporal positional encoding (STPE) module assigns identity information to each element in the sequence. The spatio-temporal grouping self-attention (STGA) module captures the spatial and temporal relationships between different joints within a spatio-temporal group. This decoupling of the spatial and temporal dimensions enables the extraction of semantic information with different meanings in each dimension. Additionally, we propose a within-group spatial global regularization mechanism to learn more general spatial attention maps, and an inter-group feature aggregation (IGFA) module to enhance the differentiation between similar actions. Our proposed method outperforms the state-of-the-art methods on two large-scale datasets in terms of both recognition accuracy and computational efficiency.
引用
收藏
页码:5733 / 5745
页数:13
相关论文
共 60 条
  • [1] Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition
    Agahian, Saeid
    Negin, Farhood
    Kose, Cemal
    [J]. VISUAL COMPUTER, 2019, 35 (04) : 591 - 607
  • [2] [Anonymous], 2015, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2015.7299172
  • [3] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [4] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [5] Carion N, 2020, European conference on computer vision, P213, DOI DOI 10.1007/978-3-030-58452-813
  • [6] Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition
    Chen, Tailin
    Zhou, Desen
    Wang, Jian
    Wang, Shidong
    Guan, Yu
    He, Xuming
    Ding, Errui
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4334 - 4342
  • [7] Skeleton-Based Action Recognition with Shift Graph Convolutional Network
    Cheng, Ke
    Zhang, Yifan
    He, Xiangyu
    Chen, Weihan
    Cheng, Jian
    Lu, Hanqing
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 180 - 189
  • [8] Dosovitskiy A., 2020, ICLR, V20, DOI 10.48550/arXiv.2010.11929
  • [9] Revisiting Skeleton-based Action Recognition
    Duan, Haodong
    Zhao, Yue
    Chen, Kai
    Lin, Dahua
    Dai, Bo
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2959 - 2968
  • [10] Improving Action Recognition via Temporal and Complementary Learning
    Elmadany, Nour Eldin
    He, Yifeng
    Guan, Ling
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2021, 12 (03)