Bias-Corrected Q-Learning With Multistate Extension

被引：14

作者：

Lee, Donghun ^{[1
]}

Powell, Warren B. ^{[2
]}

机构：

[1] Princeton Univ, Dept Comp Sci, Comp Sci, Princeton, NJ 08540 USA

[2] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08540 USA

来源：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL | 2019年 / 64卷 / 10期

关键词：

Bias correction; electricity storage; Q-learning; smart grid; STOCHASTIC-APPROXIMATION; CONVERGENCE; RATES;

D O I：

10.1109/TAC.2019.2912443

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Q-learning is a sample-based model-free algorithm that solves Markov decision problems asymptotically, but in finite time, it can perform poorly when random rewards and transitions result in large variance of value estimates. We pinpoint its cause to be the estimation bias due to the maximum operator in Q-learning algorithm, and present the evidence of max-operator bias in its Q value estimates. We then present an asymptotically optimal bias-correction strategy and construct an extension to bias-corrected Q-learning algorithm to multistate Markov decision processes, with asymptotic convergence properties as strong as those from Q-learning. We report the empirical performance of the bias-corrected Q-learning algorithm with multistate extension in two model problems: A multiarmed bandit version of Roulette and an electricity storage control simulation. The bias-corrected Q-learning algorithm with multistate extension is shown to control max-operator bias effectively, where the bias-resistance can be tuned predictably by adjusting a correction parameter.

引用

页码：4011 / 4023

页数：13

共 20 条

[1] The ODE method for convergence of stochastic approximation and reinforcement learning [J].

Borkar, VS ;

Meyn, SP .

SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469

[2] Q-learning algorithms for constrained Markov decision processes with randomized monotone policies:: Application to MIMO transmission control [J].

Djonin, Dejan V. ;

Krishnamurthy, Vikram .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2007, 55 (05) :2170-2181

[3]

Durante J., TIME SERIES METHODS

[4]

Even-Dar E, 2003, J MACH LEARN RES, V5, P1

[5]

Graepel T., 2004, Proceedings of the International Conference on Computer Games: Artificial Intelligence, Design and Education, P193

[6] ON THE CONVERGENCE OF STOCHASTIC ITERATIVE DYNAMIC-PROGRAMMING ALGORITHMS [J].

JAAKKOLA, T ;

JORDAN, MI ;

SINGH, SP .

NEURAL COMPUTATION, 1994, 6 (06) :1185-1201

[7]

Kearns M, 1999, ADV NEUR IN, V11, P996

[8]

Lee D, 2013, IEEE SYMP ADAPT DYNA, P93, DOI 10.1109/ADPRL.2013.6614994

[9]

Lee Donghun., 2012, Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI'12, P316

[10] Kernel-based reinforcement learning [J].

Ormoneit, D ;

Sen, S .

MACHINE LEARNING, 2002, 49 (2-3) :161-178

← 1 2 →