ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

被引:2
作者
An, Dong [1 ,2 ]
Wang, Hanqing [3 ]
Wang, Wenguan [4 ]
Wang, Zun [5 ]
Huang, Yan [1 ,2 ]
He, Keji [1 ,2 ]
Wang, Liang [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Ctr Res Intelligent Percept & Comp CRIPAC, Sch Future Technol, Natl Lab Pattern Recognit NLPR, Beijing 101408, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101408, Peoples R China
[3] Beijing Inst Technol, Beijing 100811, Peoples R China
[4] Zhejiang Univ, Hangzhou 310027, Zhejiang, Peoples R China
[5] Australian Natl Univ, Canberra, ACT 2601, Australia
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Navigation; Task analysis; Planning; Layout; Transformers; Semantics; Measurement; Vision-language navigation; topological map; obstacle avoidance; SLAM;
D O I
10.1109/TPAMI.2024.3386695
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively.
引用
收藏
页码:5130 / 5145
页数:16
相关论文
共 95 条
[1]  
An D., 2023, P IEEE CVF INT C COM, P2737
[2]   Neighbor-view Enhanced Model for Vision and Language Navigation [J].
An, Dong ;
Qi, Yuankai ;
Huang, Yan ;
Wu, Qi ;
Wang, Liang ;
Tan, Tieniu .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5101-5109
[3]  
An D, 2022, Arxiv, DOI arXiv:2206.11610
[4]  
Anderson P., 2021, C ROB LEARN, P671
[5]  
Anderson P, 2018, Arxiv, DOI arXiv:1807.06757
[6]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[7]  
Bengio S, 2015, ADV NEUR IN, V28
[8]   Matterport3D: Learning from RGB-D Data in Indoor Environments [J].
Chang, Angel ;
Dai, Angela ;
Funkhouser, Thomas ;
Halber, Maciej ;
Niessner, Matthias ;
Savva, Manolis ;
Song, Shuran ;
Zeng, Andy ;
Zhang, Yinda .
PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2017, :667-676
[9]  
Chaplot D. S., 2019, P INT C LEARN REPR
[10]  
Chaplot DS, 2021, ADV NEUR IN, V34