ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language

被引：131

作者：

Chen, Dave Zhenyu ^{[1
]}

Chang, Angel X. ^{[2
]}

Niessner, Matthias ^{[1
]}

机构：

[1] Tech Univ Munich, Munich, Germany

[2] Simon Fraser Univ, Burnaby, BC, Canada

来源：

COMPUTER VISION - ECCV 2020, PT XX | 2020年 / 12365卷

关键词：

D O I：

10.1007/978-3-030-58565-5_13

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51, 583 descriptions of 11, 046 objects from 800 ScanNet [8] scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D (Code: https://daveredrum.github.io/ScanRefer/).

引用

页码：202 / 221

页数：20

共 73 条

[1]

Acharya M, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1955

[2] Shapeglot: Learning Language for Shape Differentiation [J].

Achlioptas, Panos ;

Fan, Judy ;

Hawkins, Robert ;

Goodman, Noah ;

Guihas, Leonidas .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8937-8946

[3] Matterport3D: Learning from RGB-D Data in Indoor Environments [J].

Chang, Angel ;

Dai, Angela ;

Funkhouser, Thomas ;

Halber, Maciej ;

Niessner, Matthias ;

Savva, Manolis ;

Song, Shuran ;

Zeng, Andy ;

Zhang, Yinda .

PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2017, :667-676

[4] See-Through-Text Grouping for Referring Image Segmentation [J].

Chen, Ding-Jie ;

Jia, Songhao ;

Lo, Yi-Chen ;

Chen, Hwann-Tzong ;

Liu, Tyng-Luh .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462

[5] Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings [J].

Chen, Kevin ;

Choy, Christopher B. ;

Savva, Manolis ;

Chang, Angel X. ;

Funkhouser, Thomas ;

Savarese, Silvio .

COMPUTER VISION - ACCV 2018, PT III, 2019, 11363 :100-116

[6]

Chung JY, 2014, Arxiv, DOI arXiv:1412.3555

[7] 3DMV: Joint 3D-Multi-view Prediction for 3D Semantic Scene Segmentation [J].

Dai, Angela ;

Niessner, Matthias .

COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 :458-474

[8] Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis [J].

Dai, Angela ;

Qi, Charles Ruizhongtai ;

Niessner, Matthias .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6545-6554

[9] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes [J].

Dai, Angela ;

Chang, Angel X. ;

Savva, Manolis ;

Halber, Maciej ;

Funkhouser, Thomas ;

Niessner, Matthias .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2432-2443

[10] Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment [J].

Datta, Samyak ;

Sikka, Karan ;

Roy, Anirban ;

Ahuja, Karuna ;

Parikh, Devi ;

Divakaran, Ajay .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2601-2610

← 1 2 3 4 5 6 7 8 →