Global-local contrastive multiview representation learning for skeleton-based action

被引:5
作者
Bian, Cunling [1 ]
Feng, Wei [1 ]
Meng, Fanbo [2 ]
Wang, Song [3 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Sch Comp Sci & Technol, Tianjin 300350, Peoples R China
[2] Tianjin Univ, Inst Int Engn, Tianjin 300350, Peoples R China
[3] Univ South Carolina, Dept Comp Sci & Engn, Columbia, SC 29208 USA
基金
中国国家自然科学基金;
关键词
Skeleton-based action recognition; Contrastive representation learning; Multiview; Graph convolutional network; DEEPER;
D O I
10.1016/j.cviu.2023.103655
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Skeleton-based human action recognition has been drawing more interest recently due to its low sensitivity to appearance changes and the accessibility of more skeleton data. However, the skeletons captured in practice are sensitive to the view of an actor, given the occlusion of different human-body joints and the errors in human joint localization. Each view is noisy and incomplete, but important factors, such as motion and semantics, should be shared between all views in action representation learning. We support the classic hypothesis that a powerful representation is one that models view-invariant factors, and so does unsupervised learning. Therefore, we study this hypothesis under the framework of contrastive multiview learning, where we learn a representation for action recognition that aims to maximize the mutual information between different views of the same action sequence. Apart from that, a global-local contrastive loss is proposed to model the multi-scale co-occurrence relationships in both spatial and temporal domains. Extensive experimental results show that the proposed method significantly boosts the performance of unsupervised skeleton-based human action methods on three challenging benchmarks of PKUMMD, NTU RGB+D 60, and NTU RGB+D 120.
引用
收藏
页数:10
相关论文
共 56 条
  • [1] Balanced graph partitioning
    Andreev, Konstantin
    Raecke, Harald
    [J]. THEORY OF COMPUTING SYSTEMS, 2006, 39 (06) : 929 - 939
  • [2] [Anonymous], 2012, View Invariant Human Action Recognition Using Histograms of 3D Joints
  • [3] [Anonymous], 2018, ARXIV180910341
  • [4] Bachman P, 2019, Arxiv, DOI arXiv:1906.00910
  • [5] Efficient Video Classification Using Fewer Frames
    Bhardwaj, Shweta
    Srinivasan, Mukundhan
    Khapra, Mitesh M.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 354 - 363
  • [6] Structural Knowledge Distillation for Efficient Skeleton-Based Action Recognition
    Bian, Cunling
    Feng, Wei
    Wan, Liang
    Wang, Song
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 2963 - 2976
  • [7] Balanced Graph Edge Partition
    Bourse, Florian
    Lelarge, Marc
    Vojnovic, Milan
    [J]. PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 1456 - 1465
  • [8] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
    Cao, Zhe
    Simon, Tomas
    Wei, Shih-En
    Sheikh, Yaser
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1302 - 1310
  • [9] Chen T, 2020, Arxiv, DOI arXiv:2002.05709
  • [10] Skeleton-Based Action Recognition with Shift Graph Convolutional Network
    Cheng, Ke
    Zhang, Yifan
    He, Xiangyu
    Chen, Weihan
    Cheng, Jian
    Lu, Hanqing
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 180 - 189