Memory-Adaptive Vision-and-Language Navigation

被引:3
作者
He, Keji [1 ,2 ]
Jing, Ya [3 ]
Huang, Yan [1 ,2 ]
Lu, Zhihe [4 ]
An, Dong [1 ,5 ]
Wang, Liang [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] ByteDance AI Lab, Beijing, Peoples R China
[4] Natl Univ Singapore, Singapore, Singapore
[5] Univ Chinese Acad Sci, Sch Future Technol, Beijing, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Vision-and-Language Navigation; Memory bank; History noises; Memory-Adaptive Model;
D O I
10.1016/j.patcog.2024.110511
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision -and -Language Navigation (VLN) requests an agent to navigate in 3D environments following given instructions, where history is critical for decision -making in dynamic navigation process. Particularly, a memory bank storing histories is widely used in existing methods to incorporate with multimodel representations in current scenes for better decision -making. However, by weighting each history with a simple scalar, those methods cannot purely utilize the informative cues that co -exist with detrimental contents in each history, thereby inevitably introducing noises into decision -making. To that end, we propose a novel Memory -Adaptive Model (MAM) that can dynamically restrain the detrimental contents in histories for retaining contents that benefit navigation only. Specifically, two key modules, Visual and Textual Adaptive Modules, are designed to restrain history noises based on scene -related vision and text, respectively. A Reliability Estimator Module is further introduced to refine above adaptation operations. Our experiments on the widely used RxR and R2R datasets show that MAM outperforms its baseline method by 4.0% / 2.5% and 2% / 1% on the validation unseen/test split, respectively, wrt the SR metric.
引用
收藏
页数:13
相关论文
共 53 条
[1]   Neighbor-view Enhanced Model for Vision and Language Navigation [J].
An, Dong ;
Qi, Yuankai ;
Huang, Yan ;
Wu, Qi ;
Wang, Liang ;
Tan, Tieniu .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5101-5109
[2]  
Anderson P, 2018, Arxiv, DOI arXiv:1807.06757
[3]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[4]   Matterport3D: Learning from RGB-D Data in Indoor Environments [J].
Chang, Angel ;
Dai, Angela ;
Funkhouser, Thomas ;
Halber, Maciej ;
Niessner, Matthias ;
Savva, Manolis ;
Song, Shuran ;
Zeng, Andy ;
Zhang, Yinda .
PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2017, :667-676
[5]   Semantic Curiosity for Active Visual Learning [J].
Chaplot, Devendra Singh ;
Jiang, Helen ;
Gupta, Saurabh ;
Gupta, Abhinav .
COMPUTER VISION - ECCV 2020, PT VI, 2020, 12351 :309-326
[6]   CAAN: Context-Aware attention network for visual question answering [J].
Chen, Chongqing ;
Han, Dezhi ;
Chang, Chin -Chen .
PATTERN RECOGNITION, 2022, 132
[7]   Reinforced Structured State-Evolution for Vision-Language Navigation [J].
Chen, Jinyu ;
Gao, Chen ;
Meng, Erli ;
Zhang, Qiong ;
Liu, Si .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15429-15438
[8]  
Chen SZ, 2021, ADV NEUR IN, V34
[9]   Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation [J].
Chen, Shizhe ;
Guhur, Pierre-Louis ;
Tapaswi, Makarand ;
Schmid, Cordelia ;
Laptev, Ivan .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16516-16526
[10]  
Deitke M, 2022, Arxiv, DOI [arXiv:2210.06849, DOI 10.48550/ARXIV.2210.06849]