Characterizing reinforcement learning methods through parameterized learning problems

被引:11
作者
Kalyanakrishnan, Shivaram [1 ]
Stone, Peter [1 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78701 USA
基金
美国国家科学基金会;
关键词
Reinforcement learning; Empirical evaluation; Partial observability; Function approximation; Parameterized learning problem; EVOLUTIONARY ALGORITHMS; SELECTION; ISSUES;
D O I
10.1007/s10994-011-5251-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The field of reinforcement learning (RL) has been energized in the past few decades by elegant theoretical results indicating under what conditions, and how quickly, certain algorithms are guaranteed to converge to optimal policies. However, in practical problems, these conditions are seldom met. When we cannot achieve optimality, the performance of RL algorithms must be measured empirically. Consequently, in order to meaningfully differentiate learning methods, it becomes necessary to characterize their performance on different problems, taking into account factors such as state estimation, exploration, function approximation, and constraints on computation and memory. To this end, we propose parameterized learning problems, in which such factors can be controlled systematically and their effects on learning methods characterized through targeted studies. Apart from providing very precise control of the parameters that affect learning, our parameterized learning problems enable benchmarking against optimal behavior; their relatively small sizes facilitate extensive experimentation. Based on a survey of existing RL applications, in this article, we focus our attention on two predominant, "first order" factors: partial observability and function approximation. We design an appropriate parameterized learning problem, through which we compare two qualitatively distinct classes of algorithms: on-line value function-based methods and policy search methods. Empirical comparisons among various methods within each of these classes project Sarsa(lambda) and Q-learning(lambda) as winners among the former, and CMA-ES as the winner in the latter. Comparing Sarsa(lambda) and CMA-ES further on relevant problem instances, our study highlights regions of the problem space favoring their contrasting approaches. Short run-times for our experiments allow for an extensive search procedure that provides additional insights on relationships between method-specific parameters-such as eligibility traces, initial weights, and population sizes-and problem instances.
引用
收藏
页码:205 / 247
页数:43
相关论文
共 128 条
  • [1] Albus J. S., 1981, BRAINS BEHAV ROBOTIC
  • [2] [Anonymous], 1957, Dynamic Programming
  • [3] [Anonymous], 2009, P 26 ANN INT C MACH
  • [4] [Anonymous], AITR04314
  • [5] [Anonymous], 2008, 16th European Symposium on Artificial Neural Networks
  • [6] [Anonymous], 2008, P 7 INT JOINT C AUTO
  • [7] [Anonymous], 2000, P 17 INT C MACH LEAR
  • [8] [Anonymous], 1994, P 8 YAL WORKSH AD LE
  • [9] [Anonymous], P 9 INT C AD NAT COM
  • [10] [Anonymous], 2009, CMA EVOLUTION STRATE