A sparse sampling algorithm for near-optimal planning in large Markov decision processes

被引：255

作者：

Kearns, M

Mansour, Y

Ng, AY

机构：

[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA

[2] Tel Aviv Univ, Dept Comp Sci, IL-69978 Tel Aviv, Israel

[3] Univ Calif Berkeley, Dept Comp Sci, Berkeley, CA 94704 USA

来源：

MACHINE LEARNING | 2002年 / 49卷 / 2-3期

关键词：

reinforcement learning; Markov decision processes; planning;

D O I：

10.1023/A:1017932429737

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP. In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case. In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states. The running time is exponential in the horizon time (which depends only on the discount factor gamma and the desired degree of approximation to the optimal policy). Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration-rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter. Our algorithm is based on the idea of sparse sampling. We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP. Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng. Neural information processing systems 13, to appear).

引用

页码：193 / 208

页数：16

共 18 条

[1]

Aho A.V., 1974, The Design and Analysis of Computer Algorithms

[2] LEARNING TO ACT USING REAL-TIME DYNAMIC-PROGRAMMING [J].

BARTO, AG ;

BRADTKE, SJ ;

SINGH, SP .

ARTIFICIAL INTELLIGENCE, 1995, 72 (1-2) :81-138

[3]

BONET B, 1997, P 14 NAT C ART INT

[4]

BOUTILIER C, 1995, P 14 INT JOINT C ART, P1104

[5]

BOYEN X, 1998, P 1998 C UNC ART INT

[6]

Davies S, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P753

[7]

DEARDEN R, 1994, P 10 ANN C UNC ART I

[8]

KEARNS M, 1999, NEURAL INFORMATION P, V12

[9]

KEARNS M, IN PRESS NEURAL INFO, V13

[10]

KOENIG S, 1998, P 4 INT C ART INT PL

← 1 2 →