Efficient Reinforcement Learning via Probabilistic Trajectory Optimization

被引:20
作者
Pan, Yunpeng [1 ]
Boutselis, George, I [2 ]
Theodorou, Evangelos A. [2 ]
机构
[1] JD Com Amer Technol Corp, JD X Silicon Valley Res Ctr, Santa Clara, CA 95054 USA
[2] Georgia Inst Technol, Dept Aerosp Engn, Atlanta, GA 30332 USA
基金
美国国家科学基金会;
关键词
Dynamic programming; Gaussian processes (GPs); optimal control; reinforcement learning (RL); trajectory optimization; OPTIMAL FEEDBACK-CONTROL; GAUSSIAN PROCESS; UNCERTAINTY;
D O I
10.1109/TNNLS.2017.2764499
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a trajectory optimization approach to reinforcement learning in continuous state and action spaces, called probabilistic differential dynamic programming (PDDP). Our method represents systems dynamics using Gaussian processes (GPs), and performs local dynamic programming iteratively around a nominal trajectory in Gaussian belief spaces. Different from model-based policy search methods, PDDP does not require a policy parameterization and learns a time-varying control policy via successive forward-backward sweeps. A convergence analysis of the iterative scheme is given, showing that our algorithm converges to a stationary point globally under certain conditions. We show that prior model knowledge can be incorporated into the proposed framework to speed up learning, and a generalized optimization criterion based on the predicted cost distribution can be employed to enable risk-sensitive learning. We demonstrate the effectiveness and efficiency of the proposed algorithm using nontrivial tasks. Compared with a state-of-the-art GP-based policy search method, PDDP offers a superior combination of learning speed, data efficiency, and applicability.
引用
收藏
页码:5459 / 5474
页数:16
相关论文
共 58 条
[1]  
Abbeel P., 2007, Advances in Neural Information Processing Systems, V19, P1
[2]  
[Anonymous], 2011, Proceedings of the 28th international conference on machine learning icml-11
[3]  
[Anonymous], 2011, P INT C MACH LEARN
[4]  
[Anonymous], 2009, P 26 ANN INT C MACH
[5]  
[Anonymous], 2003, P ADV NEUR INF PROC
[6]  
Atkeson CG, 1997, IEEE INT CONF ROBOT, P3557, DOI 10.1109/ROBOT.1997.606886
[7]  
Barto A.G., 2004, HDB LEARNING APPROXI
[8]  
Bertsekas D.P., 1996, Athena Scientific, V7, P15
[10]  
Boedecker J, 2014, 2014 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING (ADPRL), P1