Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

被引:42
作者
Hong, Yicong [1 ]
Wang, Zun [1 ]
Wu, Qi [2 ]
Gould, Stephen [1 ]
机构
[1] Australian Natl Univ, Canberra, ACT, Australia
[2] Univ Adelaide, Adelaide, SA, Australia
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.01500
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments, training agents that cannot generalize across the two. Although learning to navigate in continuous spaces is closer to the real-world, training such an agent is significantly more difficult than training an agent in discrete spaces. However, recent advances in discrete VLN are challenging to translate to continuous VLN due to the domain gap. The fundamental difference between the two setups is that discrete navigation assumes prior knowledge of the connectivity graph of the environment, so that the agent can effectively transfer the problem of navigation with low-level controls to jumping from node to node with high-level actions by grounding to an image of a navigable direction. To bridge the discrete-to-continuous gap, we propose a predictor to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environ- ments. We refine the connectivity graph of Matterport3D to fit the continuous Habitat-Matterport3D, and train the waypoints predictor with the refined graphs to produce accessible waypoints at each time step. Moreover, we demonstrate that the predicted waypoints can be augmented during training to diversify the views and paths, and therefore enhance agent's generalization ability. Through extensive experiments we show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using lowlevel actions, which reduces the absolute discrete-tocontinuous gap by 11.76% Success Weighted by Path Length (SPL) for the Cross-Modal Matching Agent and 18.24% SPL for the VLN o BERT. Our agents, trained with a simple imitation learning objective, outperform previous methods by a large margin, achieving new state-of-the-art results on the testing environments of the R2R-CE and the RxR-CE datasets.
引用
收藏
页码:15418 / 15428
页数:11
相关论文
共 59 条
[1]  
Anderson P., 2018, On evaluation of embodied navigation agents
[2]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[3]  
Anderson Peter, 2021, PMLR, P671
[4]  
Batra Dhruv, 2020, Objectnav revisited: On evaluation of embodied agents navigating to objects, V2, P7
[5]  
Bengio S, 2015, ADV NEUR IN, V28
[6]   Matterport3D: Learning from RGB-D Data in Indoor Environments [J].
Chang, Angel ;
Dai, Angela ;
Funkhouser, Thomas ;
Halber, Maciej ;
Niessner, Matthias ;
Savva, Manolis ;
Song, Shuran ;
Zeng, Andy ;
Zhang, Yinda .
PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2017, :667-676
[7]   Neural Topological SLAM for Visual Navigation [J].
Chaplot, Devendra Singh ;
Salakhutdinov, Ruslan ;
Gupta, Abhinav ;
Gupta, Saurabh .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12872-12881
[8]  
Chaplot Devendra Singh, 2019, INT C LEARN REPR
[9]  
Chen C., 2020, ECCV
[10]  
Chen C., 2020, INT C LEARN REPR