Relative Q-Learning for Average-Reward Markov Decision Processes With Continuous States

被引：1

作者：

Yang, Xiangyu ^{[1
]}

Hu, Jiaqiao ^{[2
]}

Hu, Jian-Qiang ^{[3
]}

机构：

[1] Shandong Univ, Sch Management, Jinan 250100, Peoples R China

[2] SUNY Stony Brook, Dept Appl Math & Stat, Stony Brook, NY 11794 USA

[3] Fudan Univ, Sch Management, Shanghai 200433, Peoples R China

来源：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL | 2024年 / 69卷 / 10期

基金：

中国博士后科学基金; 中国国家自然科学基金; 美国国家科学基金会;

关键词：

Q-learning; Approximation algorithms; Mathematical models; Markov decision processes; Trajectory; Prediction algorithms; Optimization; Dynamic systems and control; Markov processes; online computation; ALGORITHMS;

D O I：

10.1109/TAC.2024.3371380

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Markov decision processes (MDPs) are widely used for modeling sequential decision-making problems under uncertainty. We propose an online algorithm for solving a class of average-reward MDPs with continuous state spaces in a model-free setting. The algorithm combines the classical relative Q-learning with an asynchronous averaging procedure, which permits the Q-value estimate at a state-action pair to be updated based on observations at other neighboring pairs sampled in subsequent iterations. These point estimates are then retained and used for constructing an interpolation-based function approximator that predicts the Q-function values at unexplored state-action pairs. We show that with probability one the sequence of function approximators converges to the optimal Q-function up to a constant. Numerical results on a simple benchmark example are reported to illustrate the algorithm.

引用

页码：6546 / 6560

页数：15

共 48 条

[11] Chang HS, 2013, COMMUN CONTROL ENG, P1, DOI 10.1007/978-1-4471-5022-0
[12] DAI J. G., 2022, Stochastic Systems, V12, P30, DOI 10.1287/STSY.2021. 0081
[13] Q-Learning With Uniformly Bounded Variance
Devraj, Adithya M.
Meyn, Sean P.
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (11) : 5948 - 5963
[14] Dewanto V, 2021, Arxiv, DOI arXiv:2010.08920
[15] Durrett R., 2019, Probability: Theory and Examples, Vfifth
[16] On the optimality equation for average cost Markov decision processes and its validity for inventory control
Feinberg, Eugene A.
Liang, Yan
[J]. ANNALS OF OPERATIONS RESEARCH, 2022, 317 (02) : 569 - 586
[17] Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities
Feinberg, Eugene A.
Kasyanov, Pavlo O.
Zadoianchuk, Nina V.
[J]. MATHEMATICS OF OPERATIONS RESEARCH, 2012, 37 (04) : 591 - 607
[18] Glasserman P., 2004, Monte Carlo methods in financial engineering, V53
[19] A Universal Empirical Dynamic Programming Algorithm for Continuous State MDPs
Haskell, William B.
Jain, Rahul
Sharma, Hiteshi
Yu, Pengqian
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2020, 65 (01) : 115 - 129
[20] Hu J., A Q-learning algorithm for Markov decision processes with continuous state spaces

← 1 2 3 4 5 →