Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM

被引:1
|
作者
Wang, Jiawei [1 ]
Wang, Teng [2 ]
Cai, Wenzhe [2 ]
Xu, Lele [2 ]
Sun, Changyin [2 ,3 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Hefei 230601, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2025年 / 10卷 / 01期
基金
中国国家自然科学基金;
关键词
Navigation; Trajectory; Visualization; Reinforcement learning; Feature extraction; Cognition; Robots; Transformers; Sun; Large language models; Vision-and-language navigation (VLN); large language models; reinforcement learning (RL); attention; discriminator;
D O I
10.1109/LRA.2024.3511402
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.
引用
收藏
页码:612 / 619
页数:8
相关论文
共 42 条
  • [31] Towards Efficient Mapless Navigation Using Deep Reinforcement Learning with Parameter Space Noise
    Liu, Xiaoyun
    Zhou, Qingrui
    Wang, Hui
    Yang, Ying
    PROCEEDINGS OF THE 38TH CHINESE CONTROL CONFERENCE (CCC), 2019, : 8833 - 8837
  • [32] Vision-based control in the open racing car simulator with deep and reinforcement learning
    Zhu Y.
    Zhao D.
    Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (12) : 15673 - 15685
  • [33] Using reinforcement learning with external rewards for open-domain natural language generation
    Srinivasan, Vidhushini
    Santhanam, Sashank
    Shaikh, Samira
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2021, 56 (01) : 189 - 206
  • [34] Adaptive Deep Reinforcement Learning for Efficient 3D Navigation of Autonomous Underwater Vehicles
    Politi, Elena
    Stefanidou, Artemis
    Chronis, Christos
    Dimitrakopoulos, George
    Varlamis, Iraklis
    IEEE ACCESS, 2024, 12 : 178209 - 178221
  • [35] Using reinforcement learning with external rewards for open-domain natural language generation
    Vidhushini Srinivasan
    Sashank Santhanam
    Samira Shaikh
    Journal of Intelligent Information Systems, 2021, 56 : 189 - 206
  • [36] An Efficient Reinforcement Learning-Based Cooperative Navigation Algorithm for Multiple UAVs in Complex Environments
    Zhang, Lijuan
    Yi, Weiguo
    Lin, Hang
    Peng, Jiabin
    Gao, Pan
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (10) : 12396 - 12406
  • [37] Reinforced Imitation: Sample Efficient Deep Reinforcement Learning for Mapless Navigation by Leveraging Prior Demonstrations
    Pfeiffer, Mark
    Shukla, Samarth
    Turchetta, Matteo
    Cadena, Cesar
    Krause, Andreas
    Siegwart, Roland
    Nieto, Juan
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2018, 3 (04): : 4423 - 4430
  • [38] Distributed Energy-Efficient Multi-UAV Navigation for Long-Term Communication Coverage by Deep Reinforcement Learning
    Liu, Chi Harold
    Ma, Xiaoxin
    Gao, Xudong
    Tang, Jian
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2020, 19 (06) : 1274 - 1285
  • [39] An Advisor-Based Architecture for a Sample-Efficient Training of Autonomous Navigation Agents with Reinforcement Learning
    Wijesinghe, Rukshan Darshana
    Tissera, Dumindu
    Vithanage, Mihira Kasun
    Xavier, Alex
    Fernando, Subha
    Samarawickrama, Jayathu
    ROBOTICS, 2023, 12 (05)
  • [40] A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model
    Hu, Panwen
    Xiao, Nan
    Li, Feifei
    Chen, Yongquan
    Huang, Rui
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6441 - 6450