Habitat: A Platform for Embodied AI Research

被引：789

作者：

Savva, Manolis ^{[1
,4
]}

Kadian, Abhishek ^{[1
]}

Maksymets, Oleksandr ^{[1
]}

Zhao, Yili ^{[1
]}

Wijmans, Erik ^{[1
,2
,3
]}

Jain, Bhavana ^{[1
]}

Straub, Julian ^{[2
]}

Liu, Jia ^{[1
]}

Koltun, Vladlen ^{[5
]}

Malik, Jitendra ^{[1
,6
]}

Parikh, Devi ^{[1
,3
]}

Batra, Dhruv ^{[1
,3
]}

机构：

[1] Facebook AI Res, Menlo Pk, CA 94025 USA

[2] Facebook Real Labs, Pittsburgh, PA USA

[3] Georgia Inst Technol, Atlanta, GA 30332 USA

[4] Simon Fraser Univ, Burnaby, BC, Canada

[5] Intel Labs, Santa Clara, CA USA

[6] Univ Calif Berkeley, Berkeley, CA USA

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

关键词：

D O I：

10.1109/ICCV.2019.00943

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast - when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (ii) Habitat-API: a modular high-level library for end-to-end development of embodied AI algorithms - defining tasks (e.g. navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents. These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or `merely' impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works [19, 16] and find evidence for the opposite conclusion - that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and (2) we conduct the first cross-dataset generalization experiments {train, test}.{Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.

引用

页码：9338 / 9346

页数：9

共 28 条

[1]

Ammirato Phil, 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA), P1378, DOI 10.1109/ICRA.2017.7989164

[2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].

Anderson, Peter ;

Wu, Qi ;

Teney, Damien ;

Bruce, Jake ;

Johnson, Mark ;

Sunderhauf, Niko ;

Reid, Ian ;

Gould, Stephen ;

van den Hengel, Anton .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683

[3]

Anderson Peter, 2018, ARXIV180706757

[4]

[Anonymous], 2015, TUK CENT WORKSH

[5] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[6] 3D Semantic Parsing of Large-Scale Indoor Spaces [J].

Armeni, Iro ;

Sener, Ozan ;

Zamir, Amir R. ;

Jiang, Helen ;

Brilakis, Ioannis ;

Fischer, Martin ;

Savarese, Silvio .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1534-1543

[7]

Bewley A, 2019, IEEE INT CONF ROBOT, P4818, DOI [10.1109/ICRA.2019.8793668, 10.1109/icra.2019.8793668]

[8]

Brodeur S., 2017, ARXIV171111017

[9]

Chang A., 2017, INT C 3D VIS 3DV

[10]

Das Abhishek, 2018, P IEEE C COMP VIS PA

← 1 2 3 →