A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients

被引:685
作者
Grondman, Ivo [1 ]
Busoniu, Lucian [2 ,3 ,4 ]
Lopes, Gabriel A. D. [1 ]
Babuska, Robert [1 ]
机构
[1] Delft Univ Technol, Delft Ctr Syst & Control, NL-2628 CD Delft, Netherlands
[2] Univ Lorraine, CRAN, UMR 7039, F-54500 Vandoeuvre Les Nancy, France
[3] CNRS, CRAN, UMR 7039, F-54500 Vandoeuvre Les Nancy, France
[4] Tech Univ Cluj Napoca, Dept Automat, Cluj Napoca 400020, Romania
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS | 2012年 / 42卷 / 06期
关键词
Actor-critic; natural gradient; policy gradient; reinforcement learning (RL); ALGORITHM; COST; APPROXIMATION;
D O I
10.1109/TSMCC.2012.2218595
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Policy-gradient-based actor-critic algorithms are amongst the most popular algorithms in the reinforcement learning framework. Their advantage of being able to search for optimal policies using low-variance gradient estimates has made them useful in several real-life applications, such as robotics, power control, and finance. Although general surveys on reinforcement learning techniques already exist, no survey is specifically dedicated to actor-critic algorithms in particular. This paper, therefore, describes the state of the art of actor-critic algorithms, with a focus on methods that can work in an online setting and use function approximation in order to deal with continuous state and action spaces. After starting with a discussion on the concepts of reinforcement learning and the origins of actor-critic algorithms, this paper describes the workings of the natural gradient, which has made its way into many actor-critic algorithms over the past few years. A review of several standard and natural actor-critic algorithms is given, and the paper concludes with an overview of application areas and a discussion on open issues.
引用
收藏
页码:1291 / 1307
页数:17
相关论文
共 85 条
[1]  
ALEKSANDROV VM, 1968, ENG CYBERN, P11
[2]   Natural gradient works efficiently in learning [J].
Amari, S .
NEURAL COMPUTATION, 1998, 10 (02) :251-276
[3]  
Amari S, 1998, INT CONF ACOUST SPEE, P1213, DOI 10.1109/ICASSP.1998.675489
[4]  
[Anonymous], 2008, P 25 INT C MACHINE L
[5]  
[Anonymous], 2007, DYNAMIC PROGRAMMING
[6]  
[Anonymous], 2010, ALGORITHMS REINFORCE
[7]  
[Anonymous], 2003, INT JOINT C ART INT
[8]  
[Anonymous], ADV NEURAL INFORM PR
[9]  
[Anonymous], 166 CUEDFINFENGTR U
[10]  
[Anonymous], 2008, Proc. Advances in Neural Information Processing Systems (NIPS)