Policy Search for the Optimal Control of Markov Decision Processes: A Novel Particle-Based Iterative Scheme

被引：14

作者：

Manganini, Giorgio ^{[1
]}

Pirotta, Matteo ^{[1
]}

Restelli, Marcello ^{[1
]}

Piroddi, Luigi ^{[1
]}

Prandini, Maria ^{[1
]}

机构：

[1] Politecn Milan, Dipartimento Elettron Informaz & Bioingn, I-20133 Milan, Italy

来源：

IEEE TRANSACTIONS ON CYBERNETICS | 2016年 / 46卷 / 11期

基金：

欧盟地平线“2020”;

关键词：

Approximate dynamic programming (ADP); Markov decision processes (MDPs); policy search; reinforcement learning (RL); stochastic optimal control; HYBRID SYSTEMS; REINFORCEMENT; REACHABILITY; ALGORITHM;

D O I：

10.1109/TCYB.2015.2483780

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Classical approximate dynamic programming techniques based on state-space gridding become computationally impracticable for high-dimensional problems. Policy search techniques cope with this curse of dimensionality issue by searching for the optimal control policy in a restricted parameterized policy space. We here focus on the case of discrete action space and introduce a novel policy parametrization that adopts particles to describe the map from the state space to the action space, each particle representing a region of the state space that is mapped into a certain action. The locations and actions associated with the particles describing a policy can be tuned by means of a recently introduced policy gradient method with parameter-based exploration. The task of selecting an appropriately sized set of particles is here solved through an iterative policy building scheme that adds new particles to improve the policy performance and is also capable of removing redundant particles. Experiments demonstrate the scalability of the proposed approach as the dimensionality of the state-space grows.

引用

页码：2643 / 2655

页数：13

共 42 条

[1]

Abate A, 2007, LECT NOTES COMPUT SC, V4416, P4

[2] Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems [J].

Abate, Alessandro ;

Prandini, Maria ;

Lygeros, John ;

Sastry, Shankar .

AUTOMATICA, 2008, 44 (11) :2724-2734

[3]

[Anonymous], 2007, DYNAMIC PROGRAMMING

[4]

[Anonymous], 2013, The Cross-Entropy Method

[5]

[Anonymous], 2010, Algorithms for Reinforcement Learning

[6]

Antos A., 2008, Advances in Neural Information Processing Systems, P9

[7]

Bagnell JA, 2001, IEEE INT CONF ROBOT, P1615, DOI 10.1109/ROBOT.2001.932842

[8] Approximate policy iteration: A survey and some new methods [J].

Bertsekas D.P. .

Journal of Control Theory and Applications, 2011, 9 (3) :310-335

[9]

BERTSEKAS D. P., 1996, Stochastic optimal control: the discrete-time case

[10]

Busoniu L, 2010, AUTOM CONTROL ENG SE, P1, DOI 10.1201/9781439821091-f

← 1 2 3 4 5 →