A simultaneous perturbation Stochastic approximation-based actor-critic algorithm for Markov decision processes

被引:21
作者
Bhatnagar, S [1 ]
Kumar, S [1 ]
机构
[1] Indian Inst Sci, Dept Comp Sci & Automat, Bangalore 560012, Karnataka, India
关键词
actor-critic algorithms; Markov decision processes; simultaneous perturbation stochastic approximation (SPSA); two timescale stochastic approximation;
D O I
10.1109/TAC.2004.825622
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A two-timescale simulation-based actor-critic algorithm for solution of infinite horizon Markov decision processes with finite state and compact action spaces under the discounted cost criterion is proposed. The algorithm does gradient search on the slower timescale in the space of deterministic policies and uses simultaneous perturbation stochastic approximation-based estimates. On the faster scale, the value function corresponding to A given stationary policy is updated and averaged over a fixed number of epochs (for enhanced performance). The proof of convergence to a locally optimal policy is presented. Finally, numerical experiments using the proposed algorithm on flow control in a bottleneck link using a continuous time queueing model are shown.
引用
收藏
页码:592 / 598
页数:7
相关论文
共 16 条
[1]   NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS [J].
BARTO, AG ;
SUTTON, RS ;
ANDERSON, CW .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05) :834-846
[2]  
Bertsekas D., 1996, NEURO DYNAMIC PROGRA, V1st
[3]   Some pathological traps for stochastic approximation [J].
Brandiere, O .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1998, 36 (04) :1293-1314
[4]   RECURSIVE STOCHASTIC ALGORITHMS FOR GLOBAL OPTIMIZATION IN RD [J].
GELFAND, SB ;
MITTER, SK .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1991, 29 (05) :999-1018
[5]   Recursive approaches for single sample path based Markov reward processes [J].
Fang, H.T. ;
Chen, H.F. ;
Cao, X.R. .
Asian Journal of Control, 2001, 3 (01) :21-26
[6]   CONVERGENT ACTIVATION DYNAMICS IN CONTINUOUS-TIME NETWORKS [J].
HIRSCH, MW .
NEURAL NETWORKS, 1989, 2 (05) :331-349
[7]   Actor-critic-type learning algorithms for Markov decision processes [J].
Konda, VR ;
Borkar, VS .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1999, 38 (01) :94-123
[8]   On actor-critic algorithms [J].
Konda, VR ;
Tsitsiklis, JN .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) :1143-1166
[9]  
KUSHNER HJ, 1978, STOCHASTIC APPROXIMA
[10]   Simulation-based optimization of Markov reward processes [J].
Marbach, P ;
Tsitsiklis, JN .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2001, 46 (02) :191-209