Point-LGMask: Local and Global Contexts Embedding for Point Cloud Pre-Training With Multi-Ratio Masking

被引:8
作者
Tang, Yuan [1 ]
Li, Xianzhi [1 ]
Xu, Jinfeng [1 ]
Yu, Qiao [1 ]
Hu, Long [1 ]
Hao, Yixue [1 ]
Chen, Min [2 ,3 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Peoples R China
[2] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou 510640, Peoples R China
[3] Pazhou Lab, Guangzhou 510640, Peoples R China
基金
中国国家自然科学基金;
关键词
Point cloud compression; Task analysis; Three-dimensional displays; Predictive models; Self-supervised learning; Representation learning; Context modeling; Local and global contexts embedding; self-supervised learning; point cloud understanding; representation learning;
D O I
10.1109/TMM.2023.3282568
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Self-supervised learning has achieved great success in both natural language processing and 2D vision, where masked modeling is a quite popular pre-training scheme. However, extending masking to 3D point cloud understanding that combines local and global features poses a new challenge. In our work, we present Point-LGMask, a novel method to embed both local and global contexts with multi-ratio masking, which is quite effective for self-supervised feature learning of point clouds but is unfortunately ignored by existing pre-training works. Specifically, to avoid fitting to a fixed masking ratio, we first propose multi-ratio masking, which prompts the encoder to fully explore representative features thanks to tasks of different difficulties. Next, to encourage the embedding of both local and global features, we formulate a compound loss, which consists of (i) a global representation contrastive loss to encourage the cluster assignments of the masked point clouds to be consistent to that of the completed input, and (ii) a local point cloud prediction loss to encourage accurate prediction of masked points. Equipped with our Point-LGMask, we show that our learned representations transfer well to various downstream tasks, including few-shot classification, shape classification, object part segmentation, as well as real-world scene-based 3D object detection and 3D semantic segmentation. Particularly, our model largely advances existing pre-training methods on the difficult few-shot classification task using the real-captured ScanObjectNN dataset by surpassing over 4% to the second-best method. Also, our Point-LGMask achieves 0.4% AP(25) and 0.8% AP(50) gains on 3D object detection task over the second-best method. For semantic segmentation, our Point-LGMask surpasses the second-best method by 0.4% mAcc and 0.5% mIoU.
引用
收藏
页码:8360 / 8370
页数:11
相关论文
共 51 条
[1]   CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding [J].
Afham, Mohamed ;
Dissanayake, Isuru ;
Dissanayake, Dinithi ;
Dharmasiri, Amaya ;
Thilakarathna, Kanchana ;
Rodrigo, Ranga .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :9892-9902
[2]   3D Semantic Parsing of Large-Scale Indoor Spaces [J].
Armeni, Iro ;
Sener, Ozan ;
Zamir, Amir R. ;
Jiang, Helen ;
Brilakis, Ioannis ;
Fischer, Martin ;
Savarese, Silvio .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1534-1543
[3]   Masked Siamese Networks for Label-Efficient Learning [J].
Assran, Mahmoud ;
Caron, Mathilde ;
Misra, Ishan ;
Bojanowski, Piotr ;
Bordes, Florian ;
Vincent, Pascal ;
Joulin, Armand ;
Rabbat, Mike ;
Ballas, Nicolas .
COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 :456-473
[4]  
Bao H., 2022, P INT C LEARN REPR
[5]   Learning a Structured Latent Space for Unsupervised Point Cloud Completion [J].
Cai, Yingjie ;
Lin, Kwan-Yee ;
Zhang, Chao ;
Wang, Qiang ;
Wang, Xiaogang ;
Li, Hongsheng .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5533-5543
[6]  
Chen T, 2020, PR MACH LEARN RES, V119
[7]   Exploring Simple Siamese Representation Learning [J].
Chen, Xinlei ;
He, Kaiming .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15745-15753
[8]   An Empirical Study of Training Self-Supervised Vision Transformers [J].
Chen, Xinlei ;
Xie, Saining ;
He, Kaiming .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9620-9629
[9]   ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes [J].
Dai, Angela ;
Chang, Angel X. ;
Savva, Manolis ;
Halber, Maciej ;
Funkhouser, Thomas ;
Niessner, Matthias .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2432-2443
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171