TEXPLORE: real-time sample-efficient reinforcement learning for robots

被引:65
作者
Hester, Todd [1 ]
Stone, Peter [1 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
基金
美国国家科学基金会;
关键词
Reinforcement learning; Robotics; MDP; Real-time; SYSTEMS;
D O I
10.1007/s10994-012-5322-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The use of robots in society could be expanded by using reinforcement learning (RL) to allow robots to learn and adapt to new situations online. RL is a paradigm for learning sequential decision making tasks, usually formulated as a Markov Decision Process (MDP). For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in real-time. In addition, the algorithm must learn efficiently in the face of noise, sensor/actuator delays, and continuous state features. In this article, we present texplore, the first algorithm to address all of these challenges together. texplore is a model-based RL method that learns a random forest model of the domain which generalizes dynamics to unseen states. The agent explores states that are promising for the final policy, while ignoring states that do not appear promising. With sample-based planning and a novel parallel architecture, texplore can select actions continually in real-time whenever necessary. We empirically evaluate the importance of each component of texplore in isolation and then demonstrate the complete algorithm learning to control the velocity of an autonomous vehicle in real-time.
引用
收藏
页码:385 / 429
页数:45
相关论文
共 64 条
[1]  
Albus J. S., 1975, Transactions of the ASME. Series G, Journal of Dynamic Systems, Measurement and Control, V97, P220, DOI 10.1115/1.3426922
[2]  
[Anonymous], 2008, J PHYS AGENTS, DOI 10.14198/JoPha.2008.2.1.03
[3]  
[Anonymous], 2003, J. Mach. Learn. Res.
[4]  
Asmuth J, 2009, P 25 C UNC ART INT U
[5]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[6]   LEARNING TO ACT USING REAL-TIME DYNAMIC-PROGRAMMING [J].
BARTO, AG ;
BRADTKE, SJ ;
SINGH, SP .
ARTIFICIAL INTELLIGENCE, 1995, 72 (1-2) :81-138
[7]  
Brafman R.I., 2001, P 17 INT JOINT C ART, P953
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]  
Chakraborty D., 2011, P 28 INT C MACH LEAR
[10]   Parallel Monte-Carlo Tree Search [J].
Chaslot, Guillaume M. J. -B. ;
Winands, Mark H. M. ;
van den Herik, H. Jaap .
COMPUTERS AND GAMES, 2008, 5131 :60-+