TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

被引：172

作者：

Chen, Howard ^{[1
,4
]}

Suhr, Alane ^{[2
,3
]}

Misra, Dipendra ^{[2
,3
]}

Snavely, Noah ^{[2
,3
]}

Artzi, Yoav ^{[2
,3
]}

机构：

[1] ASAPP Inc, New York, NY 10007 USA

[2] Cornell Univ, Dept Comp Sci, New York, NY 10021 USA

[3] Cornell Univ, Cornell Tech, New York, NY 10021 USA

[4] Cornell Univ, New York, NY 10021 USA

来源：

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年

关键词：

D O I：

10.1109/CVPR.2019.01282

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object. The data contains 9326 examples of English instructions and spatial descriptions paired with demonstrations. We perform qualitative linguistic analysis, and show that the data displays a rich use of spatial reasoning. Empirical analysis shows the data presents an open challenge to existing methods.

引用

页码：12530 / 12539

页数：10