Task-agnostic Exploration in Reinforcement Learning

被引：0

作者：

Zhang, Xuezhou ^{[1
]}

Ma, Yuzhe ^{[1
]}

Singla, Adish ^{[2
]}

机构：

[1] UW Madison, Madison, WI 53706 USA

[2] MPI SWS, Saarbrucken, Germany

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Efficient exploration is one of the main challenges in reinforcement learning (RL). Most existing sample-efficient algorithms assume the existence of a single reward function during exploration. In many practical scenarios, however, there is not a single underlying reward function to guide the exploration, for instance, when an agent needs to learn many skills simultaneously, or multiple conflicting objectives need to be balanced. To address these challenges, we propose the task-agnostic RL framework: In the exploration phase, the agent first collects trajectories by exploring the MDP without the guidance of a reward function. After exploration, it aims at finding near-optimal policies for N tasks, given the collected trajectories augmented with sampled rewards for each task. We present an efficient task-agnostic RL algorithm, UCBZERO, that finds epsilon-optimal policies for N arbitrary tasks after at most (O) over tilde (log(N)H(5)SA/epsilon(2)) exploration episodes, where H is the episode length, S is the state space size, and A is the action space size. We also provide an Omega(log(N)H(2)SA/epsilon(2)) lower bound, showing that the log dependency on N is unavoidable. Furthermore, we provide an N-independent sample complexity bound of UCBZERO in the recently proposed reward-free setting, a statistically easier setting where the ground truth reward functions are known.

引用

页数：10

共 28 条

[1]

Agrawal Shipra, 2017, Advances in Neural Information Processing Systems, P1184

[2]

Azar MG, 2017, PR MACH LEARN RES, V70

[3]

Dann, 2018, ARXIV181103056

[4]

Dann C., 2015, Advances in Neural Information Processing Systems, V2, P2818

[5] Hierarchical reinforcement learning with the MAXQ value function decomposition [J].

Dietterich, TG .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2000, 13 :227-303

[6]

Ecoffet A., 2019, Go-explore: a new approach for hard-exploration problems, Vabs, P10995

[7]

Finn Chelsea, 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA), P2786, DOI 10.1109/ICRA.2017.7989324

[8]

Jaksch T, 2010, J MACH LEARN RES, V11, P1563

[9]

Jin C., 2018, P 32 INT C NEURAL IN, P4863

[10]

Jin Chi, 2020, ARXIV200202794

← 1 2 3 →