Relative Q-Learning for Average-Reward Markov Decision Processes With Continuous States

被引:1
作者
Yang, Xiangyu [1 ]
Hu, Jiaqiao [2 ]
Hu, Jian-Qiang [3 ]
机构
[1] Shandong Univ, Sch Management, Jinan 250100, Peoples R China
[2] SUNY Stony Brook, Dept Appl Math & Stat, Stony Brook, NY 11794 USA
[3] Fudan Univ, Sch Management, Shanghai 200433, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金; 美国国家科学基金会;
关键词
Q-learning; Approximation algorithms; Mathematical models; Markov decision processes; Trajectory; Prediction algorithms; Optimization; Dynamic systems and control; Markov processes; online computation; ALGORITHMS;
D O I
10.1109/TAC.2024.3371380
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Markov decision processes (MDPs) are widely used for modeling sequential decision-making problems under uncertainty. We propose an online algorithm for solving a class of average-reward MDPs with continuous state spaces in a model-free setting. The algorithm combines the classical relative Q-learning with an asynchronous averaging procedure, which permits the Q-value estimate at a state-action pair to be updated based on observations at other neighboring pairs sampled in subsequent iterations. These point estimates are then retained and used for constructing an interpolation-based function approximator that predicts the Q-function values at unexplored state-action pairs. We show that with probability one the sequence of function approximators converges to the optimal Q-function up to a constant. Numerical results on a simple benchmark example are reported to illustrate the algorithm.
引用
收藏
页码:6546 / 6560
页数:15
相关论文
共 48 条
  • [1] Learning algorithms or Markov decision processes with average cost
    Abounadi, J
    Bertsekas, D
    Borkar, VS
    [J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2001, 40 (03) : 681 - 698
  • [2] STOCHASTIC KRIGING FOR SIMULATION METAMODELING
    Ankenman, Bruce
    Nelson, Barry L.
    Staum, Jeremy
    [J]. 2008 WINTER SIMULATION CONFERENCE, VOLS 1-5, 2008, : 362 - 370
  • [3] [Anonymous], 1993, PROC ICML, DOI [10.1016/B978-1-55860-307-3.50045-9, 10.1016/b978-1-55860, DOI 10.1080/00207540500525254, DOI 10.1016/B978-1-55860-307-3.50045-9]
  • [4] [Anonymous], 1996, Handbook of Computational Economics, DOI DOI 10.1016/S1574-0021(96)01016-7
  • [5] DISCRETE-TIME CONTROLLED MARKOV-PROCESSES WITH AVERAGE COST CRITERION - A SURVEY
    ARAPOSTATHIS, A
    BORKAR, VS
    FERNANDEZGAUCHERAND, E
    GHOSH, MK
    MARCUS, SI
    [J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1993, 31 (02) : 282 - 344
  • [6] Whittle index based Q-learning for restless bandits with average reward
    Avrachenkov, Konstantin E.
    Borkar, Vivek S.
    [J]. AUTOMATICA, 2022, 139
  • [7] Baumert S., 2002, Tech. Rep. 01-03
  • [8] Bertsekas D. P., 2007, Dynamic programming and optimal control
  • [9] BORKAR V. S., 2008, Stochastic Approximation: A Dynamical Systems Viewpoint, V48
  • [10] Busoniu L, 2010, AUTOM CONTROL ENG SE, P1, DOI 10.1201/9781439821091-f