Q-learning and policy iteration algorithms for stochastic shortest path problems

被引:0
作者
Huizhen Yu
Dimitri P. Bertsekas
机构
[1] MIT,Lab. for Information and Decision Systems
[2] MIT,Dept. of Electr. Engineering and Comp. Science
来源
Annals of Operations Research | 2013年 / 208卷
关键词
Markov decision processes; Q-learning; Approximate dynamic programming; Value iteration; Policy iteration; Stochastic shortest paths; Stochastic approximation;
D O I
暂无
中图分类号
学科分类号
摘要
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.
引用
收藏
页码:95 / 132
页数:37
相关论文
共 28 条
[1]  
Baudet G. M.(1978)Asynchronous iterative methods for multiprocessors Journal of the Association for Computing Machinery 25 226-244
[2]  
Bertsekas D. P.(1982)Distributed dynamic programming IEEE Transactions on Automatic Control 27 610-616
[3]  
Bertsekas D. P.(1983)Asynchronous distributed computation of fixed points Mathematical Programming 27 107-120
[4]  
Bertsekas D. P.(1991)An analysis of stochastic shortest path problems Mathematics of Operations Research 16 580-595
[5]  
Tsitsiklis J. N.(2009)Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 27-50
[6]  
Bertsekas D. P.(2012)Q-learning and enhanced policy iteration in discounted dynamic programming Mathematics of Operations Research 37 66-94
[7]  
Yu H.(2012)(Approximate) iterated successive approximations algorithm for sequential decision processes Annals of Operations Research 16 207-239
[8]  
Bertsekas D. P.(2006)A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning Discrete Event Dynamic Systems 84 23-29
[9]  
Yu H.(1962)Optimal pursuit strategies in discrete state probabilistic systems Transactions of the ASME. Series D. Journal of Basic Engineering 17 392-397
[10]  
Canbolat P. G.(1992)Stationary strategies in Borel dynamic programming Mathematics of Operations Research 6 1185-1201