An adaptive sampling algorithm for solving Markov decision processes

被引:69
作者
Chang, HS [1 ]
Fu, MC
Hu, JQ
Marcus, SI
机构
[1] Sogang Univ, Dept Comp Sci & Engn, Seoul 121742, South Korea
[2] Univ Maryland, Robert H Smith Sch Business, College Pk, MD 20742 USA
[3] Univ Maryland, Syst Res Inst, College Pk, MD 20742 USA
[4] Univ Maryland, Dept Elect & Comp Engn, College Pk, MD 20742 USA
关键词
D O I
10.1287/opre.1040.0145
中图分类号
C93 [管理学];
学科分类号
12 ; 1201 ; 1202 ; 120202 ;
摘要
Based on recent results for multiarmed bandit problems, we propose an adaptive sampling algorithm that approximates the optimal value of a finite-horizon Markov decision process (MDP) with finite state and action spaces. The algorithm adaptively chooses which action to sample as the sampling process proceeds and generates an asymptotically unbiased estimator, whose bias is bounded by a quantity that converges to zero at rate (lnN)/N, where N is the total number of samples that are used per state sampled in each stage. The worst-case running-time complexity of the algorithm is O((vertical bar A vertical bar N)(H)), independent of the size of the state space, where vertical bar A vertical bar is the size of the action space and H is the horizon length. The algorithm can be used to create an approximate receding horizon control to solve infinite-horizon MDPs. To illustrate the algorithm, computational results are reported on simple examples from inventory control.
引用
收藏
页码:126 / 139
页数:14
相关论文
共 12 条
[1]   ASYMPTOTICALLY EFFICIENT ADAPTIVE ALLOCATION SCHEMES FOR CONTROLLED MARKOV-CHAINS - FINITE PARAMETER SPACE [J].
AGRAWAL, R ;
TENEKETZIS, D ;
ANANTHARAM, V .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1989, 34 (12) :1249-1259
[3]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[4]   A survey of computational complexity results in systems and control [J].
Blondel, VD ;
Tsitsiklis, JN .
AUTOMATICA, 2000, 36 (09) :1249-1274
[5]   Pricing American-style securities using simulation [J].
Broadie, M ;
Glasserman, P .
JOURNAL OF ECONOMIC DYNAMICS & CONTROL, 1997, 21 (8-9) :1323-1352
[6]   Asymptotically efficient adaptive choice of control laws in controlled Markov chains [J].
Graves, TL ;
Lai, TL .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1997, 35 (03) :715-743
[7]   ERROR-BOUNDS FOR ROLLING HORIZON POLICIES IN DISCRETE-TIME MARKOV CONTROL PROCESSES [J].
HERNANDEZLERMA, O ;
LASSERRE, JB .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1990, 35 (10) :1118-1124
[9]   ASYMPTOTICALLY EFFICIENT ADAPTIVE ALLOCATION RULES [J].
LAI, TL ;
ROBBINS, H .
ADVANCES IN APPLIED MATHEMATICS, 1985, 6 (01) :4-22
[10]  
[No title captured]