Learning Navigational Visual Representations with Semantic Map Supervision

被引：10

作者：

Hong, Yicong ^{[1
,2
]}

Zhou, Yang ^{[1
]}

Zhang, Ruiyi ^{[1
]}

Dernoncourt, Franck ^{[1
]}

Bui, Trung ^{[1
]}

Gould, Stephen ^{[2
]}

Tan, Hao ^{[1
]}

机构：

[1] Adobe Res, San Francisco, CA 94107 USA

[2] Australian Natl Univ, Canberra, ACT, Australia

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

关键词：

LANGUAGE;

D O I：

10.1109/ICCV51070.2023.00284

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego2- Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego2-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-andlanguage navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.

引用

页码：3032 / 3044

页数：13

共 105 条

[1]

An Dong, 2022, ARXIV220611610

[2]

Anderson P., 2018, On evaluation of embodied navigation agents

[3]

Batra Dhruv, 2020, Objectnav revisited: On evaluation of embodied agents navigating to objects

[4]

Bucker A., 2022, ARXIV220802918

[5] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[6]

Cartillier V, 2021, AAAI CONF ARTIF INTE, V35, P964

[7] Neural Topological SLAM for Visual Navigation [J].

Chaplot, Devendra Singh ;

Salakhutdinov, Ruslan ;

Gupta, Abhinav ;

Gupta, Saurabh .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12872-12881

[8]

Chaplot Devendra Singh, 2019, INT C LEARN REPR

[9]

Chaplot DS., 2020, NEURIPS, V33, P4247

[10]

Chattopadhyay Prithvijit, 2021, P IEEECVF INT C COMP, P15691

← 1 2 3 4 5 6 7 8 9 10 →