Relative Q-Learning for Average-Reward Markov Decision Processes With Continuous States

被引：1

作者：

Yang, Xiangyu ^{[1
]}

Hu, Jiaqiao ^{[2
]}

Hu, Jian-Qiang ^{[3
]}

机构：

[1] Shandong Univ, Sch Management, Jinan 250100, Peoples R China

[2] SUNY Stony Brook, Dept Appl Math & Stat, Stony Brook, NY 11794 USA

[3] Fudan Univ, Sch Management, Shanghai 200433, Peoples R China

来源：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL | 2024年 / 69卷 / 10期

基金：

中国博士后科学基金; 中国国家自然科学基金; 美国国家科学基金会;

关键词：

Q-learning; Approximation algorithms; Mathematical models; Markov decision processes; Trajectory; Prediction algorithms; Optimization; Dynamic systems and control; Markov processes; online computation; ALGORITHMS;

D O I：

10.1109/TAC.2024.3371380

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Markov decision processes (MDPs) are widely used for modeling sequential decision-making problems under uncertainty. We propose an online algorithm for solving a class of average-reward MDPs with continuous state spaces in a model-free setting. The algorithm combines the classical relative Q-learning with an asynchronous averaging procedure, which permits the Q-value estimate at a state-action pair to be updated based on observations at other neighboring pairs sampled in subsequent iterations. These point estimates are then retained and used for constructing an interpolation-based function approximator that predicts the Q-function values at unexplored state-action pairs. We show that with probability one the sequence of function approximators converges to the optimal Q-function up to a constant. Numerical results on a simple benchmark example are reported to illustrate the algorithm.

引用

页码：6546 / 6560

页数：15

共 48 条

[1] Learning algorithms or Markov decision processes with average cost
Abounadi, J
Bertsekas, D
Borkar, VS
[J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2001, 40 (03) : 681 - 698
[2] STOCHASTIC KRIGING FOR SIMULATION METAMODELING
Ankenman, Bruce
Nelson, Barry L.
Staum, Jeremy
[J]. 2008 WINTER SIMULATION CONFERENCE, VOLS 1-5, 2008, : 362 - 370
[3] [Anonymous], 1993, PROC ICML, DOI [10.1016/B978-1-55860-307-3.50045-9, 10.1016/b978-1-55860, DOI 10.1080/00207540500525254, DOI 10.1016/B978-1-55860-307-3.50045-9]
[4] [Anonymous], 1996, Handbook of Computational Economics, DOI DOI 10.1016/S1574-0021(96)01016-7
[5] DISCRETE-TIME CONTROLLED MARKOV-PROCESSES WITH AVERAGE COST CRITERION - A SURVEY
ARAPOSTATHIS, A
BORKAR, VS
FERNANDEZGAUCHERAND, E
GHOSH, MK
MARCUS, SI
[J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1993, 31 (02) : 282 - 344
[6] Whittle index based Q-learning for restless bandits with average reward
Avrachenkov, Konstantin E.
Borkar, Vivek S.
[J]. AUTOMATICA, 2022, 139
[7] Baumert S., 2002, Tech. Rep. 01-03
[8] Bertsekas D. P., 2007, Dynamic programming and optimal control
[9] BORKAR V. S., 2008, Stochastic Approximation: A Dynamical Systems Viewpoint, V48
[10] Busoniu L, 2010, AUTOM CONTROL ENG SE, P1, DOI 10.1201/9781439821091-f

← 1 2 3 4 5 →