Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

被引:2
|
作者
Sun, Shengkai [1 ]
Liu, Daizong [2 ]
Dong, Jianfeng [3 ]
Qu, Xiaoye [4 ]
Gao, Junyu [5 ]
Yang, Xun [6 ]
Wang, Xun [3 ]
Wang, Meng [7 ]
机构
[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China
[2] Peking Univ, Beijing, Peoples R China
[3] Zhejiang Gongshang Univ, Zhejiang Key Lab E Commerce, Hangzhou, Peoples R China
[4] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[5] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[6] Univ Sci & Technol China, Hefei, Peoples R China
[7] Hefei Univ Technol, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal Learning; Unsupervised Representation Learning; Action Understanding;
D O I
10.1145/3581783.3612449
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models (i.e., joint, bone, and motion), then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multimodal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of unimodal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeletonbased action representation learning. Our source code is available at https://github.com/HuiGuanLab/UmURL.
引用
收藏
页码:2973 / 2984
页数:12
相关论文
共 50 条
  • [1] Hierarchical Contrast for Unsupervised Skeleton-Based Action Representation Learning
    Dong, Jianfeng
    Sun, Shengkai
    Liu, Zhonglin
    Chen, Shujie
    Liu, Baolong
    Wang, Xun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 525 - 533
  • [2] Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition
    Lin, Lilang
    Wu, Lehong
    Zhang, Jiahang
    Wang, Jiaying
    COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 75 - 92
  • [3] Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition
    He, Zhiquan
    Lv, Jiantu
    Fang, Shizhang
    NEUROCOMPUTING, 2024, 582
  • [4] Unsupervised skeleton-based action representation learning via relation consistency pursuit
    Wenjing Zhang
    Yonghong Hou
    Haoyuan Zhang
    Neural Computing and Applications, 2022, 34 : 20327 - 20339
  • [5] Unsupervised skeleton-based action representation learning via relation consistency pursuit
    Zhang, Wenjing
    Hou, Yonghong
    Zhang, Haoyuan
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (22): : 20327 - 20339
  • [6] EnsCLR: Unsupervised skeleton-based action recognition via ensemble contrastive learning of representation
    Wang, Kun
    Cao, Jiuxin
    Cao, Biwei
    Liu, Bo
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 247
  • [7] Fast Multi-Modal Unified Sparse Representation Learning
    Verma, Mridula
    Shukla, Kaushal Kumar
    PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 448 - 452
  • [8] Bootstrapped Representation Learning for Skeleton-Based Action Recognition
    Moliner, Olivier
    Huang, Sangxia
    Astrom, Kalle
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4153 - 4163
  • [9] Progressive semantic learning for unsupervised skeleton-based action recognition
    Qin, Hao
    Chen, Luyuan
    Kong, Ming
    Zhao, Zhuoran
    Zeng, Xianzhou
    Lu, Mengxu
    Zhu, Qiang
    MACHINE LEARNING, 2025, 114 (03)
  • [10] Unsupervised Multi-modal Learning
    Iqbal, Mohammed Shameer
    ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091 : 343 - 346