On Generalized Bellman Equations and Temporal-Difference Learning

被引:3
作者
Yu, Huizhen [1 ]
Mahmood, Ashique Rupam [1 ]
Sutton, Richard S. [1 ]
机构
[1] Univ Alberta, RLAI Lab, Dept Comp Sci, Edmonton, AB, Canada
来源
ADVANCES IN ARTIFICIAL INTELLIGENCE, CANADIAN AI 2017 | 2017年 / 10233卷
关键词
Markov decision process; Policy evaluation; Generalized bellman equation; Temporal differences; Markov chain; Randomized stopping time;
D O I
10.1007/978-3-319-57351-9_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the. parameters of TD, based on generalized Bellman equations. Our scheme is to set. according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared to prior works, this scheme is more direct and flexible, and allows much larger. values for off-policy TD learning with bounded traces. Using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both. and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.
引用
收藏
页码:3 / 14
页数:12
相关论文
共 33 条
  • [1] [Anonymous], 2014, ARXIV14056757
  • [2] [Anonymous], 2011, THESIS
  • [3] [Anonymous], 2876 LIDS MIT
  • [4] [Anonymous], 2017, MULTISTEP OFF POLICY
  • [5] Bertsekas D. P., 1996, NEURODYNAMIC PROGRAM
  • [6] Boyan J. A., 1999, P 16 INT C MACH LEAR
  • [7] Dann C, 2014, J MACH LEARN RES, V15, P809
  • [8] Dudley R. M., 2002, Real Analysis and Probability , Vol. 74 Cambridge Studies in Advanced Mathematics, V74
  • [9] Geist M, 2014, J MACH LEARN RES, V15, P289
  • [10] IMPORTANCE SAMPLING FOR STOCHASTIC SIMULATIONS
    GLYNN, PW
    IGLEHART, DL
    [J]. MANAGEMENT SCIENCE, 1989, 35 (11) : 1367 - 1392