JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

被引:1
作者
Ji, Jiayi [1 ,2 ]
Wang, Haowei [3 ]
Wu, Changli [1 ]
Ma, Yiwei [1 ]
Sun, Xiaoshuai [1 ]
Ji, Rongrong [1 ]
机构
[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China
[2] Natl Univ Singapore, Singapore 119077, Singapore
[3] Tencent, Youtu Lab, Shanghai 200000, Peoples R China
基金
中国博士后科学基金; 国家重点研发计划; 中国国家自然科学基金;
关键词
Three-dimensional displays; Solid modeling; Point cloud compression; Visualization; Representation learning; Feature extraction; Large language models; Data models; Degradation; Contrastive learning; 3D representation learning; joint multi-modal alignment; large language model; structured multimodal organizer;
D O I
10.1109/TPAMI.2024.3523675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.
引用
收藏
页码:2475 / 2492
页数:18
相关论文
共 88 条
[11]  
Fei H, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P5980
[12]  
Fei-Fei L., 2004, COMPUT VIS IMAGE UND, P178, DOI DOI 10.1016/J.CVIU.2005.09.012
[13]   PCT: Point cloud transformer [J].
Guo, Meng-Hao ;
Cai, Jun-Xiong ;
Liu, Zheng-Ning ;
Mu, Tai-Jiang ;
Martin, Ralph R. ;
Hu, Shi-Min .
COMPUTATIONAL VISUAL MEDIA, 2021, 7 (02) :187-199
[14]  
Guo ZY, 2023, Arxiv, DOI arXiv:2309.00615
[15]   MVTN: Multi-View Transformation Network for 3D Shape Recognition [J].
Hamdi, Abdullah ;
Giancola, Silvio ;
Ghanem, Bernard .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1-11
[16]  
Li LH, 2019, Arxiv, DOI [arXiv:1908.03557, 10.48550/arXiv.1908.03557, DOI 10.48550/ARXIV.1908.03557]
[17]   CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition [J].
Hegde, Deepti ;
Valanarasu, Jeya Maria Jose ;
Patel, Vishal M. .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, :2020-2030
[18]  
Houlsby N, 2019, PR MACH LEARN RES, V97
[19]   Prompting Visual-Language Models for Efficient Video Understanding [J].
Ju, Chen ;
Han, Tengda ;
Zheng, Kunhao ;
Zhang, Ya ;
Xie, Weidi .
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 :105-124
[20]  
Lee S ..., 2022, P ADV NEUR INF PROC, P23580