Model-based average reward reinforcement learning

被引:40
作者
Tadepalli, P [1 ]
Ok, D
机构
[1] Oregon State Univ, Dept Comp Sci, Corvallis, OR 97331 USA
[2] Korean Army Comp Ctr, Chungnam 320919, South Korea
关键词
machine learning; Reinforcement Learning; average reward; model-based; exploration; Bayesian networks; linear regression; AGV scheduling;
D O I
10.1016/S0004-3702(98)00002-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement Learning (RL) is the study of programs that improve their performance by receiving rewards and punishments from the environment. Most RL methods optimize the discounted total reward received by an agent, while, in many domains, the natural criterion is to optimize the average reward per time step. In this paper, we introduce a model-based Average-reward Reinforcement Learning method called H-learning and show that it converges more quickly and robustly than its discounted counterpart in the domain of scheduling a simulated Automatic Guided Vehicle (AGV). We also introduce a version of H-learning that automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this "Auto-exploratory H-Learning" performs better than the previously studied exploration strategies. To scale H-learning to larger state spaces, we extend it to learn action models and reward functions in the form of dynamic Bayesian networks, and approximate its value function using local linear regression. We show that both of these extensions are effective in significantly reducing the space requirement of H-learning and making it converge faster in some AGV scheduling tasks. (C) 1998 Published by Elsevier Science B.V.
引用
收藏
页码:177 / 224
页数:48
相关论文
共 46 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]  
[Anonymous], 1995, P 14 INT JOINT C ART
[3]  
Atkeson CG, 1997, ARTIF INTELL REV, V11, P75, DOI 10.1023/A:1006511328852
[4]  
Atkeson CG, 1997, ARTIF INTELL REV, V11, P11, DOI 10.1023/A:1006559212014
[5]   LEARNING TO ACT USING REAL-TIME DYNAMIC-PROGRAMMING [J].
BARTO, AG ;
BRADTKE, SJ ;
SINGH, SP .
ARTIFICIAL INTELLIGENCE, 1995, 72 (1-2) :81-138
[6]  
BERTSEKAS D, 1995, LIDSP2307 MIT
[7]  
Bertsekas D. P., 1995, DYNAMIC PROGRAMMING
[8]  
BERTSEKAS DP, 1982, IEEE T AUTOMATIC CON, V27
[9]  
BOYAN J, 1994, P NEURAL INFORMATION
[10]  
Canavos G.c., 1984, Applied Probability and Statistical Methods