Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning

被引：20

作者：

Chen, Yuxiao ^{[1
]}

Zhao, Long ^{[2
]}

Yuan, Jianbo ^{[3
]}

Tian, Yu ^{[3
]}

Xia, Zhaoyang ^{[1
]}

Geng, Shijie ^{[1
]}

Han, Ligong ^{[1
]}

Metaxas, Dimitris N. ^{[1
]}

机构：

[1] Rutgers State Univ, Piscataway, NJ 08854 USA

[2] Google Res, Los Angeles, CA USA

[3] ByteDance Inc, Seattle, WA USA

来源：

COMPUTER VISION, ECCV 2022, PT XXVI | 2022年 / 13686卷

关键词：

Skeleton representation learning; Self-supervised learning; Action recognition; Action detection; Motion prediction; ACTION RECOGNITION;

D O I：

10.1007/978-3-031-19809-0_11

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks. The source code can be found at https://github.com/yuxiaochen1103/Hi-TRS.

引用

页码：185 / 202

页数：18

共 50 条

[1] Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation
Zhou, Hualing
Li, Xi
Xu, Dahong
Liu, Hong
Guo, Jianping
Zhang, Yihan
SENSORS, 2022, 22 (22)
[2] Video Motion Perception for Self-supervised Representation Learning
Li, Wei
Luo, Dezhao
Fang, Bo
Li, Xiaoni
Zhou, Yu
Wang, Weiping
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 508 - 520
[3] Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
Luo, Jian
Wang, Jianzong
Cheng, Ning
Xiao, Jing
INTERSPEECH 2021, 2021, : 1169 - 1173
[4] Self-supervised representation learning using multimodal Transformer for emotion recognition
Goetz, Theresa
Arora, Pulkit
Erick, F. X.
Holzer, Nina
Sawant, Shrutika
PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
[5] Self-supervised video representation learning by maximizing mutual information
Xue, Fei
Ji, Hongbing
Zhang, Wenbo
Cao, Yi
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 88
[6] Collaboratively Self-Supervised Video Representation Learning for Action Recognition
Zhang, Jie
Wan, Zhifan
Hu, Lanqing
Lin, Stephen
Wu, Shuzhe
Shan, Shiguang
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1895 - 1907
[7] Self-Supervised Video Representation Learning by Video Incoherence Detection
Cao, Haozhi
Xu, Yuecong
Mao, Kezhi
Xie, Lihua
Yin, Jianxiong
See, Simon
Xu, Qianwen
Yang, Jianfei
IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3810 - 3822
[8] Self-Supervised Time Series Representation Learning via Cross Reconstruction Transformer
Zhang, Wenrui
Yang, Ling
Geng, Shijia
Hong, Shenda
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 16129 - 16138
[9] EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning
Lin, Lilang
Liu, Jiaying
APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (04)
[10] Self-Supervised Video Representation Learning by Serial Restoration With Elastic Complexity
Chen, Ziyu
Wang, Hanli
Chen, Chang Wen
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2235 - 2248

← 1 2 3 4 5 →