JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

被引：1

作者：

Ji, Jiayi ^{[1
,2
]}

Wang, Haowei ^{[3
]}

Wu, Changli ^{[1
]}

Ma, Yiwei ^{[1
]}

Sun, Xiaoshuai ^{[1
]}

Ji, Rongrong ^{[1
]}

机构：

[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China

[2] Natl Univ Singapore, Singapore 119077, Singapore

[3] Tencent, Youtu Lab, Shanghai 200000, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2025年 / 47卷 / 04期

基金：

中国博士后科学基金; 国家重点研发计划; 中国国家自然科学基金;

关键词：

Three-dimensional displays; Solid modeling; Point cloud compression; Visualization; Representation learning; Feature extraction; Large language models; Data models; Degradation; Contrastive learning; 3D representation learning; joint multi-modal alignment; large language model; structured multimodal organizer;

D O I：

10.1109/TPAMI.2024.3523675

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.

引用

页码：2475 / 2492

页数：18

共 88 条

[1]

Achlioptas P, 2018, PR MACH LEARN RES, V80

[2]

Alayrac JB, 2022, ADV NEUR IN

[3] 3D Semantic Parsing of Large-Scale Indoor Spaces [J].

Armeni, Iro ;

Sener, Ozan ;

Zamir, Amir R. ;

Jiang, Helen ;

Brilakis, Ioannis ;

Fischer, Martin ;

Savarese, Silvio .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1534-1543

[4]

Aubry M, 2011, IEEE I CONF COMP VIS, P1411, DOI 10.1109/ICCV.2011.6126396

[5] Scale-invariant heat kernel signatures for non-rigid shape recognition [J].

Bronstein, Michael M. ;

Kokkinos, Iasonas .

2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, :1704-1711

[6] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[7]

Chen Z., 2021, P 32 BRIT MACH VIS C

[8]

Chiang Wei-Lin, 2023, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

[9] Objaverse: A Universe of Annotated 3D Objects [J].

Deitke, Matt ;

Schwenk, Dustin ;

Salvador, Jordi ;

Weihs, Luca ;

Michel, Oscar ;

VanderBilt, Eli ;

Schmidt, Ludwig ;

Ehsani, Kiana ;

Kembhavi, Aniruddha ;

Farhadi, Ali .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :13142-13153

[10]

Fei H, 2022, P MACHINE LEARNING R, P6373

← 1 2 3 4 5 6 7 8 9 →