Navigating to the Best Policy in Markov Decision Processes

被引：0

作者：

Al Marjani, Aymen ^{[1
]}

Garivier, Aurelien ^{[2
]}

Proutiere, Alexandre ^{[3
]}

机构：

[1] ENS Lyon, UMPA, Lyon, France

[2] ENS Lyon, INRIA, CNRS, UMPA, Lyon, France

[3] KTH Royal Inst Technol, EECS & Digital Futures, Stockholm, Sweden

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least 1 - d. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the generative setting [MP21], where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the navigation constraints, induced by the online setting. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.

引用

页数：13

共 42 条

[1]

Agarwal A, 2020, C LEARNING THEORY, P67

[2]

Agarwal Alekh, 2020, Advances in neural information processing systems, V33, P13399

[3]

Al Marjani Aymen, 2021, PMLR, P7459

[4]

Auer P., 2009, Advances in Neural Information Processing Systems, V21

[5] Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model [J].

Azar, Mohammad Gheshlaghi ;

Munos, Remi ;

Kappen, Hilbert J. .

MACHINE LEARNING, 2013, 91 (03) :325-349

[6] Optimal adaptive policies for Markov decision processes [J].

Burnetas, AN ;

Katehakis, MN .

MATHEMATICS OF OPERATIONS RESEARCH, 1997, 22 (01) :222-255

[7]

Cover T. M., 1991, ELEMENTS INFORM THEO, DOI 10.1002/0471200611

[8]

Degenne Remy, 2019, NEURIPS

[9]

Dietterich Thomas, 2013, PAC OPTIMAL PLANNING

[10]

Domingues O. D., 2021, ALT

← 1 2 3 4 5 →