Mutual Information Driven Equivariant Contrastive Learning for 3D Action Representation Learning

被引:0
作者
Lin, Lilang [1 ]
Zhang, Jiahang [1 ]
Liu, Jiaying [1 ]
机构
[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing 100080, Peoples R China
基金
中国国家自然科学基金;
关键词
Self-supervised learning; Skeleton; Task analysis; Representation learning; Data models; Three-dimensional displays; Convolutional neural networks; skeleton-based action recognition; contrastive learning; LSTM;
D O I
10.1109/TIP.2024.3372451
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised contrastive learning has proven to be successful for skeleton-based action recognition. For contrastive learning, data transformations are found to fundamentally affect the learned representation quality. However, traditional invariant contrastive learning is detrimental to the performance on the downstream task if the transformation carries important information for the task. In this sense, it limits the application of many data transformations in the current contrastive learning pipeline. To address these issues, we propose to utilize equivariant contrastive learning, which extends invariant contrastive learning and preserves important information. By integrating equivariant and invariant contrastive learning into a hybrid approach, the model can better leverage the motion patterns exposed by data transformations and obtain a more discriminative representation space. Specifically, a self-distillation loss is first proposed for transformed data of different intensities to fully utilize invariant transformations, especially strong invariant transformations. For equivariant transformations, we explore the potential of skeleton mixing and temporal shuffling for equivariant contrastive learning. Meanwhile, we analyze the impacts of different data transformations on the feature space in terms of two novel metrics proposed in this paper, namely, consistency and diversity. In particular, we demonstrate that equivariant learning boosts performance by alleviating the dimensional collapse problem. Experimental results on several benchmarks indicate that our method outperforms existing state-of-the-art methods.
引用
收藏
页码:1883 / 1897
页数:15
相关论文
共 67 条
  • [1] STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition
    Ahn, Dasom
    Kim, Sangwon
    Hong, Hyunsu
    Ko, Byoung Chul
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3319 - 3328
  • [2] Bachman P, 2019, ADV NEUR IN, V32
  • [3] Directional Self-supervised Learning for Heavy Image Augmentations
    Bai, Yalong
    Yang, Yifan
    Zhang, Wei
    Mei, Tao
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16671 - 16680
  • [4] Caron M, 2021, Arxiv, DOI [arXiv:2006.09882, DOI 10.48550/ARXIV.2006.09882]
  • [5] Chen Ting, 2019, 25 AMERICAS C INFORM
  • [6] Chen XL, 2020, Arxiv, DOI arXiv:2011.10566
  • [7] Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition
    Chen, Yuxin
    Zhang, Ziqi
    Yuan, Chunfeng
    Li, Bing
    Deng, Ying
    Hu, Weiming
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13339 - 13348
  • [8] Chen Z., 2022, arXiv
  • [9] Cheng X., 2021, P IEEE INT C MULT EX, P1
  • [10] InfoGCN: Representation Learning for Human Skeleton-based Action Recognition
    Chi, Hyung-gun
    Ha, Myoung Hoon
    Chi, Seunggeun
    Lee, Sang Wan
    Huang, Qixing
    Ramani, Karthik
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20154 - 20164