Multi-View Vision Fusion Network: Can 2D Pre-Trained Model Boost 3D Point Cloud Data-Scarce Learning?

被引：2

作者：

Peng, Haoyang ^{[1
]}

Li, Baopu

Zhang, Bo ^{[2
]}

Chen, Xin ^{[3
]}

Chen, Tao ^{[1
]}

Zhu, Hongyuan ^{[4
,5
]}

机构：

[1] Fudan Univ, Sch Informat Sci & Technol, Embedded Deep Learning & Visual Anal Grp, Shanghai 200433, Peoples R China

[2] Shanghai AI Lab, Shanghai 200232, Peoples R China

[3] Tencent PCG, Shanghai 200030, Peoples R China

[4] ASTAR, Inst Infocomm Res I2R, Singapore 138632, Singapore

[5] ASTAR, Ctr Frontier AI Res CFAR, Singapore 138632, Singapore

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Point cloud compression; Three-dimensional displays; Solid modeling; Visualization; Task analysis; Feature extraction; Data models; Visual prompt learning; few-shot learning; point cloud classification;

D O I：

10.1109/TCSVT.2023.3343495

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Point cloud based 3D deep model has wide applications in many applications such as autonomous driving, house robot, etc. Inspired by the recent prompt learning in natural language processing, this work proposes a novel Multi-view Vision Fusion Network (MvNet) for few-shot 3D point cloud classification. MvNet investigates the possibility of leveraging the off-the-shelf 2D pre-trained models to achieve the few-shot classification, which can alleviate the over-dependence issue of the existing baseline models towards the large-scale annotated 3D point cloud data. Specifically, MvNet first encodes a 3D point cloud into multi-view image features for a number of different views. Then, a novel multi-view prompt fusion module is developed to fuse information from different views effectively to bridge the gap between 3D point cloud data and 2D pre-trained models. A set of 2D image prompts can then be derived to better describe the suitable prior knowledge for a large-scale pre-trained image model for few-shot 3D point cloud classification. Extensive experiments on ModelNet, ScanObjectNN, and ShapeNet datasets demonstrate that MvNet achieves new state-of-the-art performance for 3D few-shot point cloud image classification. The source code of this work is available at https://github.com/invictus717/MetaTransformer.

引用

页码：5951 / 5962

页数：12

共 64 条

[1]

Uy MA, 2019, Arxiv, DOI arXiv:1908.04616

[2] DGCNN: A convolutional neural network over large-scale labeled graphs [J].

Anh Viet Phan ;

Minh Le Nguyen ;

Yen Lam Hoang Nguyen ;

Lam Thu Bui .

NEURAL NETWORKS, 2018, 108 :533-543

[3] End-to-End 3D Dense Captioning with Vote2Cap-DETR [J].

Chen, Sijin ;

Zhu, Hongyuan ;

Chen, Xin ;

Lei, Yinjie ;

Yu, Gang ;

Chen, Tao .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :11124-11133

[4] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes [J].

Dai, Angela ;

Chang, Angel X. ;

Savva, Manolis ;

Halber, Maciej ;

Funkhouser, Thomas ;

Niessner, Matthias .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2432-2443

[5]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[6]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[7] The Pascal Visual Object Classes (VOC) Challenge [J].

Everingham, Mark ;

Van Gool, Luc ;

Williams, Christopher K. I. ;

Winn, John ;

Zisserman, Andrew .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338

[8]

Garcia-Garcia A, 2016, IEEE IJCNN, P1578, DOI 10.1109/IJCNN.2016.7727386

[9]

Goyal A., 2021, arXiv

[10]

He K., 2021, arXiv, DOI DOI 10.48550/ARXIV.2111.06377

← 1 2 3 4 5 6 7 →