Simulation-based optimization of Markov decision processes: An empirical process theory approach

被引：8

作者：

Jain, Rahul ^{[1
,2
]}

Varaiya, Pravin ^{[3
]}

机构：

[1] Univ So Calif, Dept Elect Engn, Los Angeles, CA 90089 USA

[2] Univ So Calif, ISE Dept, Los Angeles, CA 90089 USA

[3] Univ Calif Berkeley, Dept EECS, Berkeley, CA 94720 USA

来源：

AUTOMATICA | 2010年 / 46卷 / 08期

基金：

美国国家科学基金会;

关键词：

Markov decision processes; Learning algorithms; Monte Carlo simulation; Stochastic Control; Optimization; UNIFORM-CONVERGENCE;

D O I：

10.1016/j.automatica.2010.05.021

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an epsilon-optimal policy from simulation. We provide sample complexity of such an approach. (C) 2010 Elsevier Ltd. All rights reserved.

引用

页码：1297 / 1304

页数：8

共 25 条

[1] Scale-sensitive dimensions, uniform convergence, and learnability [J].

Alon, N ;

BenDavid, S ;

CesaBianchi, N ;

Haussler, D .

JOURNAL OF THE ACM, 1997, 44 (04) :615-631

[2] RATE OF CONVERGENCE OF EMPIRICAL MEASURES AND COSTS IN CONTROLLED MARKOV-CHAINS AND TRANSIENT OPTIMALITY [J].

ALTMAN, E ;

ZEITOUNI, O .

MATHEMATICS OF OPERATIONS RESEARCH, 1994, 19 (04) :955-974

[3]

Anthony Martin, 1999, Neural network learning: Theoretical foundations

[4]

BARTLETT PL, 2007, ADV NEURAL INFORM PR, V19

[5] Infinite-horizon policy-gradient estimation [J].

Baxter, J ;

Bartlett, PL .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :319-350

[6]

BERTSIMAS D, 1996, MATH OPER RES, V21

[7]

Cao Xi- Ren, 2007, STOCHASTIC LEARNING

[8]

Chang HS, 2007, COMMUN CONTROL ENG, P1, DOI 10.1007/978-1-84628-690-2

[9] DYNAMIC PROGRAMMING CONDITIONS FOR PARTIALLY OBSERVABLE STOCHASTIC SYSTEMS [J].

DAVIS, MHA ;

VARAIYA, P .

SIAM JOURNAL ON CONTROL, 1973, 11 (02) :226-261

[10]

Dudley RM., 1999, UNIFORM CENTRAL LIMI

← 1 2 3 →