ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

被引:107
作者
Xue, Le [1 ]
Gao, Mingfei [1 ]
Xing, Chen [1 ]
Martin-Martin, Roberto [1 ,2 ]
Wu, Jiajun [3 ]
Xiong, Caiming [1 ]
Xu, Ran [1 ]
Niebles, Juan Carlos [1 ]
Savarese, Silvio [1 ]
机构
[1] Salesforce Res, Palo Alto, CA 94105 USA
[2] UT Austin, Austin, UT USA
[3] Stanford Univ, Stanford, CA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
关键词
D O I
10.1109/CVPR52729.2023.00120
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of image, text, and 3D point cloud by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models will be released.
引用
收藏
页码:1179 / 1189
页数:11
相关论文
共 60 条
[1]  
Armeni I., 2016, PROC CVPR IEEE, P1534, DOI DOI 10.1109/CVPR.2016.170
[2]   HBS-1: A Modular Child-Size 3D Printed Humanoid [J].
Wu, Lianjun ;
Larkin, Miles ;
Potnuru, Akshay ;
Tadesse, Yonas .
ROBOTICS, 2016, 5 (01)
[3]  
Chang Angel X., 2015, arXiv
[4]   Covalent Bisfunctionalization of Two-Dimensional Molybdenum Disulfide [J].
Chen, Xin ;
Bartlam, Cian ;
Lloret, Vicent ;
Moses Badlyan, Narine ;
Wolff, Stefan ;
Gillen, Roland ;
Stimpel-Lindner, Tanja ;
Maultzsch, Janina ;
Duesberg, Georg S. ;
Knirsch, Kathrin C. ;
Hirsch, Andreas .
ANGEWANDTE CHEMIE-INTERNATIONAL EDITION, 2021, 60 (24) :13484-13492
[5]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[6]   3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction [J].
Choy, Christopher B. ;
Xu, Danfei ;
Gwak, Jun Young ;
Chen, Kevin ;
Savarese, Silvio .
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :628-644
[7]   Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis [J].
Dai, Angela ;
Qi, Charles Ruizhongtai ;
Niessner, Matthias .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6545-6554
[8]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Gao Mingfei, 2021, ARXIV211109452