Sample Trajectory Selection Method Based on Large Language Model in Reinforcement Learning

被引:1
作者
Lai, Jinbang [1 ]
Zang, Zhaoxiang [1 ]
机构
[1] China Three Gorges Univ, Hubei Key Lab Intelligent Vis Based Monitoring Hyd, Yichang 443002, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Reinforcement learning; large language models; trajectory selection; sampling efficiency; GO;
D O I
10.1109/ACCESS.2024.3395457
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a method for trajectory selection using large-scale pre-trained language models, aiming to improve sample and training efficiency in reinforcement learning. By using a carefully designed prompt, the large language model can fully utilize its prior knowledge, effectively understanding and assessing the quality of trajectories produced through agent-environment interactions in reinforcement learning. This approach allows selecting more informative trajectories for the current agent's learning. Unlike other works that indirectly improve reinforcement learning training efficiency by generating actions or decisions through large language models, our method employs these models to choose high-quality trajectories, thereby enhancing sample efficiency in reinforcement learning more directly. The approach was evaluated across multiple benchmark tasks in OpenAI's Gym and RLcard. The results indicate a significant reduction in the number of environment interactions, and a 37% increase in the average reward compared to the original method.
引用
收藏
页码:61877 / 61885
页数:9
相关论文
共 38 条
[1]  
Akner R., 2014, P INT C MACH LEARN, V32, P1503
[2]  
Akrour R, 2011, LECT NOTES ARTIF INT, V6911, P12, DOI 10.1007/978-3-642-23780-5_11
[3]   Learning dexterous in-hand manipulation [J].
Andrychowicz, Marcin ;
Baker, Bowen ;
Chociej, Maciek ;
Jozefowicz, Rafal ;
McGrew, Bob ;
Pachocki, Jakub ;
Petron, Arthur ;
Plappert, Matthias ;
Powell, Glenn ;
Ray, Alex ;
Schneider, Jonas ;
Sidor, Szymon ;
Tobin, Josh ;
Welinder, Peter ;
Weng, Lilian ;
Zaremba, Wojciech .
INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2020, 39 (01) :3-20
[4]  
[Anonymous], 2013, P 2013 C EMP METH NA, DOI DOI 10.1371/JOURNAL.PONE.0073791
[5]   Deep Reinforcement Learning A brief survey [J].
Arulkumaran, Kai ;
Deisenroth, Marc Peter ;
Brundage, Miles ;
Bharath, Anil Anthony .
IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (06) :26-38
[6]  
Brockman G, 2016, Arxiv, DOI arXiv:1606.01540
[7]  
Brown T.B., 2020, Advances in neural information processing systems
[8]  
BUJARD H, 1987, METHOD ENZYMOL, V155, P416
[9]  
Carta T, 2024, Arxiv, DOI [arXiv:2302.02662, 10.48550/arXiv.2302.02662, DOI 10.5555/3618408.3618558]
[10]  
Chakravorty J, 2020, Arxiv, DOI arXiv:1911.12825