A tutorial on linear function approximators for dynamic programming and reinforcement learning

被引:68
作者
Geramifard, Alborz [1 ]
Walsh, Thomas J. [1 ]
Tellex, Stefanie [2 ]
Chowdhary, Girish [1 ]
Roy, Nicholas [2 ]
How, Jonathan P. [1 ]
机构
[1] MIT LIDS, United States
[2] MIT CSAIL, United States
来源
Foundations and Trends in Machine Learning | 2013年 / 6卷 / 04期
关键词
D O I
10.1561/2200000042
中图分类号
学科分类号
摘要
A Markov Decision Process (MDP) is a natural framework for formulating sequential decision-making problems under uncertainty. In recent years, researchers have greatly advanced algorithms for learning and acting in MDPs. This article reviews such algorithms, beginning with well-known dynamic programming methods for solving MDPs such as policy iteration and value iteration, then describes approximate dynamic programming methods such as trajectory based value iteration, and finally moves to reinforcement learning methods such as Q-Learning, SARSA, and least-squares policy iteration. We describe algorithms in a unified framework, giving pseudocode together with memory and iteration complexity analysis for each. Empirical evaluations of these techniques with four representations across four domains, provide insight into how these algorithms perform with various feature sets in terms of running time and performance.© 2013 A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary.
引用
收藏
页码:375 / 454
页数:79
相关论文
共 107 条
[91]  
Sutton R.S., Maei H.R., Precup D., Bhatnagar S., Silver D., Szepesvari C., Wiewiora E., Fast gradient-descent methods for temporal-difference learning with linear function approximation, International Conference on Machine Learning (ICML), pp. 993-1000, (2009)
[92]  
Szepesvari C., Algorithms for reinforcement learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, (2010)
[93]  
Szita I., Szepesvari C., Model-based reinforcement learning with nearly tight exploration complexity bounds, International Conference on Machine Learning (ICML), pp. 1031-1038, (2010)
[94]  
Taylor G., Parr R., Kernelized value function approximation for reinforcement learning, International Conference on Machine Learning (ICML), pp. 1017-1024, (2009)
[95]  
Tsitsiklis J.N., Van Roy B., An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, 42, 5, pp. 674-690, (1997)
[96]  
Tsitsiklis J.N., Van Roy B., Average cost temporal-difference learning, Automatica, 35, 11, pp. 1799-1808, (1999)
[97]  
Ure N.K., Geramifard A., Chowdhary G., How P.J., Adaptive planning for markov decision processes with uncertain transition models via incremental feature dependency discovery, European Conference on Machine Learning (ECML), (2012)
[98]  
Watkins C.J., Models of Delayed Reinforcement Learning, (1989)
[99]  
Watkins C.J., Q-learning, Machine Learning, 8, 3, pp. 279-292, (1992)
[100]  
Watkins C.J.C.H., Dayan P., Q-learning, Machine Learning, 8, 3, pp. 279-292, (1992)