A Markov chain Monte Carlo algorithm for Bayesian policy search

被引：6

作者：

Aghaei, Vahid Tavakol ^{[1
]}

Onat, Ahmet ^{[1
]}

Yildirim, Sinan ^{[2
]}

机构：

[1] Sabanci Univ, Mechatron Engn, Fac Engn & Nat Sci, Istanbul, Turkey

[2] Sabanci Univ, Ind Engn, Fac Engn & Nat Sci, Istanbul, Turkey

来源：

SYSTEMS SCIENCE & CONTROL ENGINEERING | 2018年 / 6卷 / 01期

关键词：

Reinforcement learning; Markov chain Monte Carlo; particle filtering; risk sensitive reward; policy search; control;

D O I：

10.1080/21642583.2018.1528483

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Policy search algorithms have facilitated application of Reinforcement Learning (RL) to dynamic systems, such as control of robots. Many policy search algorithms are based on the policy gradient, and thus may suffer from slow convergence or local optima complications. In this paper, we take a Bayesian approach to policy search under RL paradigm,for the problem of controlling a discrete time Markov decision process with continuous state and action spaces and with a multiplicative reward structure. For this purpose, we assume a prior over policy parameters and aim for the 'posterior' distribution where the 'likelihood' is the expected reward. We propound a Markov chain Monte Carlo algorithm as a method of generating samples for policy parameters from this posterior. The proposed algorithm is compared with certain well-known policy gradient-based RL methods and exhibits more appropriate performance in terms of time response and convergence rate, when applied to a nonlinear model of a Cart-Pole benchmark.

引用

页码：438 / 455

页数：18

共 34 条

[1] Natural gradient works efficiently in learning [J].

Amari, S .

NEURAL COMPUTATION, 1998, 10 (02) :251-276

[2] Particle Markov chain Monte Carlo methods [J].

Andrieu, Christophe ;

Doucet, Arnaud ;

Holenstein, Roman .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2010, 72 :269-342

[3] THE PSEUDO-MARGINAL APPROACH FOR EFFICIENT MONTE CARLO COMPUTATIONS [J].

Andrieu, Christophe ;

Roberts, Gareth O. .

ANNALS OF STATISTICS, 2009, 37 (02) :697-725

[4] Infinite-horizon policy-gradient estimation [J].

Baxter, J ;

Bartlett, PL .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :319-350

[5]

Dann C, 2014, J MACH LEARN RES, V15, P809

[6] Using expectation-maximization for reinforcement learning [J].

Dayan, P ;

Hinton, GE .

NEURAL COMPUTATION, 1997, 9 (02) :271-278

[7]

Del Moral P., 2004, PROB APPL S

[8]

Doucet A., 2001, SEQUENTIAL MONTE CAR

[9] NOVEL-APPROACH TO NONLINEAR NON-GAUSSIAN BAYESIAN STATE ESTIMATION [J].

GORDON, NJ ;

SALMOND, DJ ;

SMITH, AFM .

IEE PROCEEDINGS-F RADAR AND SIGNAL PROCESSING, 1993, 140 (02) :107-113

[10]

Green P. J., 2003, HIGHLY STRUCTURED ST, P179

← 1 2 3 4 →