Q-learning and policy iteration algorithms for stochastic shortest path problems

被引:0
作者
Huizhen Yu
Dimitri P. Bertsekas
机构
[1] MIT,Lab. for Information and Decision Systems
[2] MIT,Dept. of Electr. Engineering and Comp. Science
来源
Annals of Operations Research | 2013年 / 208卷
关键词
Markov decision processes; Q-learning; Approximate dynamic programming; Value iteration; Policy iteration; Stochastic shortest paths; Stochastic approximation;
D O I
暂无
中图分类号
学科分类号
摘要
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.
引用
收藏
页码:95 / 132
页数:37
相关论文
共 28 条
[11]  
Rothblum U. G.(1994)On the convergence of stochastic iterative dynamic programming algorithms Neural Computation 7 345-352
[12]  
Choi D. S.(1995)Reinforcement learning algorithm for partially observable Markov decision problems Advances in Neural Information Processing Systems 16 185-202
[13]  
Van Roy B.(1994)Asynchronous stochastic approximation and Q-learning Machine Learning 22 59-94
[14]  
Eaton J. H.(1996)Feature-based methods for large-scale dynamic programming Machine Learning 44 1840-1851
[15]  
Zadeh L. A.(1999)Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing financial derivatives IEEE Transactions on Automatic Control 40 1635-1660
[16]  
Feinberg E. A.(1969)Discrete dynamic programming with sensitive discount optimality criteria Annals of Mathematical Statistics undefined undefined-undefined
[17]  
Jaakkola T. S.(undefined)undefined undefined undefined undefined-undefined
[18]  
Jordan M. I.(undefined)undefined undefined undefined undefined-undefined
[19]  
Singh S. P.(undefined)undefined undefined undefined undefined-undefined
[20]  
Jaakkola T. S.(undefined)undefined undefined undefined undefined-undefined