Relative Q-Learning for Average-Reward Markov Decision Processes With Continuous States

被引:1
作者
Yang, Xiangyu [1 ]
Hu, Jiaqiao [2 ]
Hu, Jian-Qiang [3 ]
机构
[1] Shandong Univ, Sch Management, Jinan 250100, Peoples R China
[2] SUNY Stony Brook, Dept Appl Math & Stat, Stony Brook, NY 11794 USA
[3] Fudan Univ, Sch Management, Shanghai 200433, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金; 美国国家科学基金会;
关键词
Q-learning; Approximation algorithms; Mathematical models; Markov decision processes; Trajectory; Prediction algorithms; Optimization; Dynamic systems and control; Markov processes; online computation; ALGORITHMS;
D O I
10.1109/TAC.2024.3371380
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Markov decision processes (MDPs) are widely used for modeling sequential decision-making problems under uncertainty. We propose an online algorithm for solving a class of average-reward MDPs with continuous state spaces in a model-free setting. The algorithm combines the classical relative Q-learning with an asynchronous averaging procedure, which permits the Q-value estimate at a state-action pair to be updated based on observations at other neighboring pairs sampled in subsequent iterations. These point estimates are then retained and used for constructing an interpolation-based function approximator that predicts the Q-function values at unexplored state-action pairs. We show that with probability one the sequence of function approximators converges to the optimal Q-function up to a constant. Numerical results on a simple benchmark example are reported to illustrate the algorithm.
引用
收藏
页码:6546 / 6560
页数:15
相关论文
共 48 条
  • [11] Chang HS, 2013, COMMUN CONTROL ENG, P1, DOI 10.1007/978-1-4471-5022-0
  • [12] DAI J. G., 2022, Stochastic Systems, V12, P30, DOI 10.1287/STSY.2021. 0081
  • [13] Q-Learning With Uniformly Bounded Variance
    Devraj, Adithya M.
    Meyn, Sean P.
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (11) : 5948 - 5963
  • [14] Dewanto V, 2021, Arxiv, DOI arXiv:2010.08920
  • [15] Durrett R., 2019, Probability: Theory and Examples, Vfifth
  • [16] On the optimality equation for average cost Markov decision processes and its validity for inventory control
    Feinberg, Eugene A.
    Liang, Yan
    [J]. ANNALS OF OPERATIONS RESEARCH, 2022, 317 (02) : 569 - 586
  • [17] Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities
    Feinberg, Eugene A.
    Kasyanov, Pavlo O.
    Zadoianchuk, Nina V.
    [J]. MATHEMATICS OF OPERATIONS RESEARCH, 2012, 37 (04) : 591 - 607
  • [18] Glasserman P., 2004, Monte Carlo methods in financial engineering, V53
  • [19] A Universal Empirical Dynamic Programming Algorithm for Continuous State MDPs
    Haskell, William B.
    Jain, Rahul
    Sharma, Hiteshi
    Yu, Pengqian
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2020, 65 (01) : 115 - 129
  • [20] Hu J., A Q-learning algorithm for Markov decision processes with continuous state spaces