Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM

被引:1
|
作者
Wang, Jiawei [1 ]
Wang, Teng [2 ]
Cai, Wenzhe [2 ]
Xu, Lele [2 ]
Sun, Changyin [2 ,3 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Hefei 230601, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2025年 / 10卷 / 01期
基金
中国国家自然科学基金;
关键词
Navigation; Trajectory; Visualization; Reinforcement learning; Feature extraction; Cognition; Robots; Transformers; Sun; Large language models; Vision-and-language navigation (VLN); large language models; reinforcement learning (RL); attention; discriminator;
D O I
10.1109/LRA.2024.3511402
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.
引用
收藏
页码:612 / 619
页数:8
相关论文
共 42 条
  • [21] Local Navigation and Docking of an Autonomous Robot Mower Using Reinforcement Learning and Computer Vision
    Taghibakhshi, Ali
    Ogden, Nathan
    West, Matthew
    2021 THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2021), 2021, : 10 - 14
  • [22] Efficient Language-Guided Reinforcement Learning for Resource-Constrained Autonomous Systems
    Shiri, Aidin
    Navardi, Mozhgan
    Manjunath, Tejaswini
    Waytowich, Nicholas R.
    Mohsenin, Tinoosh
    IEEE MICRO, 2022, 42 (06) : 107 - 114
  • [23] Efficient Navigation of Colloidal Robots in an Unknown Environment via Deep Reinforcement Learning
    Yang, Yuguang
    Bevan, Michael A.
    Li, Bo
    ADVANCED INTELLIGENT SYSTEMS, 2020, 2 (01)
  • [24] Deep Reinforcement Learning Based Efficient and Robust Navigation Method For Autonomous Applications
    Hemming, Nathan
    Menon, Vineetha
    2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 287 - 293
  • [25] Efficient Reinforcement Learning for 3D LiDAR Navigation of Mobile Robot
    Zhai, Yu
    Liu, Zhe
    Miao, Yanzi
    Wang, Hesheng
    2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 3755 - 3760
  • [26] Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning
    Yu, Tong
    Shen, Yilin
    Zhang, Ruiyi
    Zeng, Xiangyu
    Jin, Hongxia
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 39 - 47
  • [27] RDDRL: a recurrent deduction deep reinforcement learning model for multimodal vision-robot navigation
    Li, Zhenyu
    Zhou, Aiguo
    APPLIED INTELLIGENCE, 2023, 53 (20) : 23244 - 23270
  • [28] Vision-based Navigation of UAV with Continuous Action Space Using Deep Reinforcement Learning
    Zhou, Benchun
    Wang, Weihong
    Liu, Zhenghua
    Wang, Jia
    PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 5030 - 5035
  • [29] Vision-Based Autonomous Navigation Approach for a Tracked Robot Using Deep Reinforcement Learning
    Ejaz, Muhammad Mudassir
    Tang, Tong Boon
    Lu, Cheng-Kai
    IEEE SENSORS JOURNAL, 2021, 21 (02) : 2230 - 2240
  • [30] RDDRL: a recurrent deduction deep reinforcement learning model for multimodal vision-robot navigation
    Zhenyu Li
    Aiguo Zhou
    Applied Intelligence, 2023, 53 : 23244 - 23270