Fine-tuning 3D foundation models for geometric object retrieval

被引：0

作者：

Van den Herrewegen, Jarne ^{[1
,2
]}

Tourwe, Tom ^{[1
]}

Ovsjanikov, Maks ^{[3
]}

Wyffels, Francis ^{[2
]}

机构：

[1] Oqton AI, Edegem, Belgium

[2] Ghent Univ Imec, AI & Robot Lab, IDLab AIRO, Zwijnaarde, Belgium

[3] Ecole Polytech, LIX, Palaiseau, France

来源：

COMPUTERS & GRAPHICS-UK | 2024年 / 122卷

关键词：

Object retrieval; Deep learning; 3D; Transfer learning; Foundation models; Self-supervised learning; NEURAL-NETWORK;

D O I：

10.1016/j.cag.2024.103993

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Foundation models, such as ULIP-2 (Xue et al., 2023) recently projected forward the field of 3D deep learning. These models are trained with significantly more data and show superior representation learning capacity in many downstream tasks like 3D shape classification and few-shot part segmentation. A particular characteristic of the recent 3D foundation models is that they are typically multi-modal, , and involve image (2D) as well as caption (text) branches. This leads to an intricate interplay that benefits all modalities. At the same time, the nature of the 3D encoders alone, involved in these foundation models is not well-understood. Specifically, there is little analysis on the utility of both pre-trained 3D features provided by these models, or their capacity to adapt to new downstream 3D data. Furthermore, existing studies typically focus on label-oriented downstream tasks, such as shape classification, and ignore other critical applications, such as 3D content-based object retrieval. In this paper, we fill this gap and show, for the first time, how 3D foundation models can be leveraged for strong 3D-to-3D retrieval performance on seven different datasets, on par with state-of-the-art view-based architectures. We evaluate both the pre-trained foundation models, as well as their fine-tuned versions using downstream data. We compare supervised fine-tuning using classification labels against two self-supervised label-free fine-tuning methods. Importantly, we introduce and describe a methodology for fine-tuning, as we found this to be crucial to make transfer learning from 3D foundation models work in a stable manner.

引用

页数：10

共 50 条

[1] CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding
Afham, Mohamed
Dissanayake, Isuru
Dissanayake, Dinithi
Dharmasiri, Amaya
Thilakarathna, Kanchana
Rodrigo, Ranga
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 9892 - 9902
[2] B.O. Community, 2018, Blender-A 3D modelling and rendering package
[3] Bardes Adrien, 2021, arXiv
[4] SELF-ORGANIZING NEURAL NETWORK THAT DISCOVERS SURFACES IN RANDOM-DOT STEREOGRAMS
BECKER, S
HINTON, GE
[J]. NATURE, 1992, 355 (6356) : 161 - 163
[5] Bromley J., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P669, DOI 10.1142/S0218001493000339
[6] Chen Ting, 2020, INT C MACH LEARN, P1597
[7] Cybenko G., 1997, Int J Smart Eng Syst Des, V1, P1
[8] Objaverse: A Universe of Annotated 3D Objects
Deitke, Matt
Schwenk, Dustin
Salvador, Jordi
Weihs, Luca
Michel, Oscar
VanderBilt, Eli
Schmidt, Ludwig
Ehsani, Kiana
Kembhavi, Aniruddha
Farhadi, Ali
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 13142 - 13153
[9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10] Multi-View Token Clustering and Fusion for 3D Object Recognition and Retrieval
Fan, Linlong
Ge, Yanqi
Li, Wen
Duan, Lixin
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1145 - 1150

← 1 2 3 4 5 →