Generalized Second-Order Value Iteration in Markov Decision Processes

被引：5

作者：

Kamanchi, Chandramouli ^{[1
]}

Diddigi, Raghuram Bharadwaj ^{[1
]}

Bhatnagar, Shalabh ^{[1
]}

机构：

[1] Indian Inst Sci, Dept Comp Sci & Automat, Bengaluru 560012, India

来源：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL | 2022年 / 67卷 / 08期

关键词：

Mathematical model; Convergence; Newton method; Standards; Approximation algorithms; Markov processes; Entropy; Markov decision process; value iteration; successive relaxation;

D O I：

10.1109/TAC.2021.3112851

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov decision process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first-order method and, therefore, it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that under a special structure of the MDP, successive overrelaxation technique computes the optimal value function faster than standard value iteration. In this article, we propose a second-order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second-order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.

引用

页码：4241 / 4247

页数：7

共 19 条

[1]

[Anonymous], 2019, GITHUB PYTHON MDPTOO

[2] DYNAMIC PROGRAMMING [J].

BELLMAN, R .

SCIENCE, 1966, 153 (3731) :34-&

[3]

Bertsekas D.P., 1996, NEURO DYNAMIC PROGRA, V3

[4]

Borkar V.S., 2008, STOCHASTIC APPROXIMA

[5]

Dai B, 2018, PR MACH LEARN RES, V80

[6]

Devraj A. M., 2017, Advances in Neural Information Processing Systems (NeurIPS), P2235

[7]

Furmston T, 2016, J MACH LEARN RES, V17

[8]

Goyal V, 2021, Arxiv, DOI arXiv:1905.09963

[9]

Haarnoja T, 2017, PR MACH LEARN RES, V70

[10]

Haarnoja T, 2018, PR MACH LEARN RES, V80

← 1 2 →