Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning

被引:20
|
作者
Chen, Yuxiao [1 ]
Zhao, Long [2 ]
Yuan, Jianbo [3 ]
Tian, Yu [3 ]
Xia, Zhaoyang [1 ]
Geng, Shijie [1 ]
Han, Ligong [1 ]
Metaxas, Dimitris N. [1 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08854 USA
[2] Google Res, Los Angeles, CA USA
[3] ByteDance Inc, Seattle, WA USA
来源
COMPUTER VISION, ECCV 2022, PT XXVI | 2022年 / 13686卷
关键词
Skeleton representation learning; Self-supervised learning; Action recognition; Action detection; Motion prediction; ACTION RECOGNITION;
D O I
10.1007/978-3-031-19809-0_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks. The source code can be found at https://github.com/yuxiaochen1103/Hi-TRS.
引用
收藏
页码:185 / 202
页数:18
相关论文
共 50 条
  • [1] Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation
    Zhou, Hualing
    Li, Xi
    Xu, Dahong
    Liu, Hong
    Guo, Jianping
    Zhang, Yihan
    SENSORS, 2022, 22 (22)
  • [2] Video Motion Perception for Self-supervised Representation Learning
    Li, Wei
    Luo, Dezhao
    Fang, Bo
    Li, Xiaoni
    Zhou, Yu
    Wang, Weiping
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 508 - 520
  • [3] Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
    Luo, Jian
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    INTERSPEECH 2021, 2021, : 1169 - 1173
  • [4] Self-supervised representation learning using multimodal Transformer for emotion recognition
    Goetz, Theresa
    Arora, Pulkit
    Erick, F. X.
    Holzer, Nina
    Sawant, Shrutika
    PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
  • [5] Self-supervised video representation learning by maximizing mutual information
    Xue, Fei
    Ji, Hongbing
    Zhang, Wenbo
    Cao, Yi
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 88
  • [6] Collaboratively Self-Supervised Video Representation Learning for Action Recognition
    Zhang, Jie
    Wan, Zhifan
    Hu, Lanqing
    Lin, Stephen
    Wu, Shuzhe
    Shan, Shiguang
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1895 - 1907
  • [7] Self-Supervised Video Representation Learning by Video Incoherence Detection
    Cao, Haozhi
    Xu, Yuecong
    Mao, Kezhi
    Xie, Lihua
    Yin, Jianxiong
    See, Simon
    Xu, Qianwen
    Yang, Jianfei
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3810 - 3822
  • [8] Self-Supervised Time Series Representation Learning via Cross Reconstruction Transformer
    Zhang, Wenrui
    Yang, Ling
    Geng, Shijia
    Hong, Shenda
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 16129 - 16138
  • [9] EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning
    Lin, Lilang
    Liu, Jiaying
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (04)
  • [10] Self-Supervised Video Representation Learning by Serial Restoration With Elastic Complexity
    Chen, Ziyu
    Wang, Hanli
    Chen, Chang Wen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2235 - 2248