Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

被引:0
|
作者
Mohammed Shahid Abdulla
Shalabh Bhatnagar
机构
[1] Indian Institute of Science,Department of Computer Science and Automation
来源
关键词
Actor-critic algorithms; Two timescale stochastic approximation; Markov decision processes; Policy iteration; Simultaneous perturbation stochastic approximation; Normalized Hadamard matrices; Reinforcement learning; TD-learning;
D O I
暂无
中图分类号
学科分类号
摘要
This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).
引用
收藏
页码:23 / 52
页数:29
相关论文
共 50 条
  • [1] Reinforcement learning based algorithms for average cost Markov Decision Processes
    Abdulla, Mohammed Shahid
    Bhatnagar, Shalabh
    DISCRETE EVENT DYNAMIC SYSTEMS-THEORY AND APPLICATIONS, 2007, 17 (01): : 23 - 52
  • [2] Learning algorithms or Markov decision processes with average cost
    Abounadi, J
    Bertsekas, D
    Borkar, VS
    SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2001, 40 (03) : 681 - 698
  • [3] Adaptive aggregation for reinforcement learning in average reward Markov decision processes
    Ronald Ortner
    Annals of Operations Research, 2013, 208 : 321 - 336
  • [4] Average Reward Reinforcement Learning for Semi-Markov Decision Processes
    Yang, Jiayuan
    Li, Yanjie
    Chen, Haoyao
    Li, Jiangang
    NEURAL INFORMATION PROCESSING, ICONIP 2017, PT I, 2017, 10634 : 768 - 777
  • [5] Adaptive aggregation for reinforcement learning in average reward Markov decision processes
    Ortner, Ronald
    ANNALS OF OPERATIONS RESEARCH, 2013, 208 (01) : 321 - 336
  • [6] Reinforcement Learning for Cost-Aware Markov Decision Processes
    Suttle, Wesley A.
    Zhang, Kaiqing
    Yang, Zhuoran
    Kraemer, David N.
    Liu, Ji
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [7] Reinforcement Learning Algorithms for Regret Minimization in Structured Markov Decision Processes
    Prabuchandran, K. J.
    Bodas, Tejas
    Tulabandhula, Theja
    AAMAS'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS, 2016, : 1289 - 1290
  • [8] A reinforcement learning based algorithm for Markov decision processes
    Bhatnagar, S
    Kumar, S
    2005 International Conference on Intelligent Sensing and Information Processing, Proceedings, 2005, : 199 - 204
  • [9] RVI Reinforcement Learning for Semi-Markov Decision Processes with Average Reward
    Li, Yanjie
    Cao, Fang
    2010 8TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), 2010, : 1674 - 1679
  • [10] On some algorithms for limiting average Markov decision processes
    Daoui, C.
    Abbad, M.
    OPERATIONS RESEARCH LETTERS, 2007, 35 (02) : 261 - 266