ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language

被引:131
作者
Chen, Dave Zhenyu [1 ]
Chang, Angel X. [2 ]
Niessner, Matthias [1 ]
机构
[1] Tech Univ Munich, Munich, Germany
[2] Simon Fraser Univ, Burnaby, BC, Canada
来源
COMPUTER VISION - ECCV 2020, PT XX | 2020年 / 12365卷
关键词
D O I
10.1007/978-3-030-58565-5_13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51, 583 descriptions of 11, 046 objects from 800 ScanNet [8] scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D (Code: https://daveredrum.github.io/ScanRefer/).
引用
收藏
页码:202 / 221
页数:20
相关论文
共 73 条
[1]  
Acharya M, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1955
[2]   Shapeglot: Learning Language for Shape Differentiation [J].
Achlioptas, Panos ;
Fan, Judy ;
Hawkins, Robert ;
Goodman, Noah ;
Guihas, Leonidas .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8937-8946
[3]   Matterport3D: Learning from RGB-D Data in Indoor Environments [J].
Chang, Angel ;
Dai, Angela ;
Funkhouser, Thomas ;
Halber, Maciej ;
Niessner, Matthias ;
Savva, Manolis ;
Song, Shuran ;
Zeng, Andy ;
Zhang, Yinda .
PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2017, :667-676
[4]   See-Through-Text Grouping for Referring Image Segmentation [J].
Chen, Ding-Jie ;
Jia, Songhao ;
Lo, Yi-Chen ;
Chen, Hwann-Tzong ;
Liu, Tyng-Luh .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462
[5]   Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings [J].
Chen, Kevin ;
Choy, Christopher B. ;
Savva, Manolis ;
Chang, Angel X. ;
Funkhouser, Thomas ;
Savarese, Silvio .
COMPUTER VISION - ACCV 2018, PT III, 2019, 11363 :100-116
[6]  
Chung JY, 2014, Arxiv, DOI arXiv:1412.3555
[7]   3DMV: Joint 3D-Multi-view Prediction for 3D Semantic Scene Segmentation [J].
Dai, Angela ;
Niessner, Matthias .
COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 :458-474
[8]   Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis [J].
Dai, Angela ;
Qi, Charles Ruizhongtai ;
Niessner, Matthias .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6545-6554
[9]   ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes [J].
Dai, Angela ;
Chang, Angel X. ;
Savva, Manolis ;
Halber, Maciej ;
Funkhouser, Thomas ;
Niessner, Matthias .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2432-2443
[10]   Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment [J].
Datta, Samyak ;
Sikka, Karan ;
Roy, Anirban ;
Ahuja, Karuna ;
Parikh, Devi ;
Divakaran, Ajay .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2601-2610