Self-Supervised Intra-Modal and Cross-Modal Contrastive Learning for Point Cloud Understanding

被引:25
|
作者
Wu, Yue [1 ]
Liu, Jiaming [1 ]
Gong, Maoguo [2 ]
Gong, Peiran [1 ]
Fan, Xiaolong [2 ]
Qin, A. K. [3 ]
Miao, Qiguang [1 ]
Ma, Wenping [4 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, Key Lab Collaborat Intelligence Syst, Minist Educ, Xian 710071, Peoples R China
[2] Xidian Univ, Sch Elect Engn, Key Lab Collaborat Intelligence Syst, Minist Educ, Xian 710071, Peoples R China
[3] Swinburne Univ Technol, Dept Comp Sci & Software Engn, Hawthorn, Vic 3122, Australia
[4] Xidian Univ, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Minist Educ, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
Point cloud compression; Three-dimensional displays; Task analysis; Feature extraction; Self-supervised learning; Image color analysis; Visualization; Self-supervision; cross-modal learning; joint; 3D-2D; point cloud understanding; NETWORK;
D O I
10.1109/TMM.2023.3284591
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning effective representations from unlabeled data is a challenging task for point cloud understanding. As the human visual system can map concepts learned from 2D images to the 3D world, and inspired by recent multimodal research, we introduce data from point cloud modality and image modality for joint learning. Based on the properties of point clouds and images, we propose CrossNet, a comprehensive intra- and cross-modal contrastive learning method that learns 3D point cloud representations. The proposed method achieves 3D-3D and 3D-2D correspondences of objectives by maximizing the consistency of point clouds and their augmented versions, and with the corresponding rendered images in invariant space. We further distinguish the rendered images into RGB and grayscale images to extract color and geometric features, respectively. These training objectives combine feature correspondences between modalities to combine rich learning signals from point clouds and images. Our CrossNet is simple: we add a feature extraction module and a projection head module to the point cloud and image branches, respectively, to train the backbone network in a self-supervised manner. After the network is pretrained, only the point cloud feature extraction module is required for fine-tuning and directly predicting results for downstream tasks. Our experiments on multiple benchmarks demonstrate improved point cloud classification and segmentation results, and the learned representations can be generalized across domains.
引用
收藏
页码:1626 / 1638
页数:13
相关论文
共 50 条
  • [1] CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding
    Afham, Mohamed
    Dissanayake, Isuru
    Dissanayake, Dinithi
    Dharmasiri, Amaya
    Thilakarathna, Kanchana
    Rodrigo, Ranga
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 9892 - 9902
  • [2] SCPNet: Unsupervised Cross-Modal Homography Estimation via Intra-modal Self-supervised Learning
    Zhang, Runmin
    Ma, Jun
    Cao, Si-Yuan
    Luo, Lun
    Yu, Beinan
    Chen, Shu-Jie
    Li, Junwei
    Shen, Hui-Liang
    COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 460 - 477
  • [3] Self-Supervised Correlation Learning for Cross-Modal Retrieval
    Liu, Yaxin
    Wu, Jianlong
    Qu, Leigang
    Gan, Tian
    Yin, Jianhua
    Nie, Liqiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863
  • [4] Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search
    Liang, Meiyu
    Du, Junping
    Liang, Zhengyang
    Xing, Yongwang
    Huang, Wei
    Xue, Zhe
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13744 - 13753
  • [5] RELATIONS BETWEEN INTRA-MODAL AND CROSS-MODAL MATCHING
    BJORKMAN, M
    SCANDINAVIAN JOURNAL OF PSYCHOLOGY, 1967, 8 (02) : 65 - &
  • [6] Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training
    Sheng, Changchong
    Pietikainen, Matti
    Tian, Qi
    Liu, Li
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2456 - 2464
  • [7] SELF-SUPERVISED LEARNING WITH CROSS-MODAL TRANSFORMERS FOR EMOTION RECOGNITION
    Khare, Aparna
    Parthasarathy, Srinivas
    Sundaram, Shiva
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 381 - 388
  • [8] COMPARISON OF CROSS-MODAL AND INTRA-MODAL FORM RECOGNITION IN CHILDREN WITH LEARNING DISABILITIES
    GAINES, BJ
    RASKIN, LM
    JOURNAL OF LEARNING DISABILITIES, 1970, 3 (05) : 243 - 246
  • [9] INTRA-MODAL AND CROSS-MODAL MATCHING IN ALZHEIMERS-DISEASE
    BUTTERS, MA
    RAPCSAK, SZ
    KASZNIAK, AW
    BONDI, MW
    JOURNAL OF CLINICAL AND EXPERIMENTAL NEUROPSYCHOLOGY, 1990, 12 (01) : 21 - 21
  • [10] Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
    Das, Srijan
    Ryoo, Michael
    2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,