Unsupervised Curricula for Visual Meta-Reinforcement Learning

被引：0

作者：

Jabri, Allan ^{[1
]}

Hsu, Kyle ^{[2
]}

Eysenbach, Benjamin ^{[3
]}

Gupta, Abhishek ^{[1
]}

Levine, Sergey ^{[1
]}

Finn, Chelsea ^{[4
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA 94704 USA

[2] Univ Toronto, Toronto, ON, Canada

[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[4] Stanford Univ, Stanford, CA 94305 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In principle, meta-reinforcement learning algorithms leverage experience across many tasks to learn fast reinforcement learning (RL) strategies that transfer to similar tasks. However, current meta-RL approaches rely on manually-defined distributions of training tasks, and hand-crafting these task distributions can be challenging and time-consuming. Can "useful" pre-training tasks be discovered in an unsupervised manner? We develop an unsupervised algorithm for inducing an adaptive meta-training task distribution, i.e. an automatic curriculum, by modeling unsupervised interaction in a visual environment. The task distribution is scaffolded by a parametric density model of the meta-learner's trajectory distribution. We formulate unsupervised meta-RL as information maximization between a latent task variable and the meta-learner's data distribution, and describe a practical instantiation which alternates between integration of recent experience into the task distribution and meta-learning of the updated tasks. Repeating this procedure leads to iterative reorganization such that the curriculum adapts as the meta-learner's data distribution shifts. In particular, we show how discriminative clustering for visual representation can support trajectory-level task acquisition and exploration in domains with pixel observations, avoiding pitfalls of alternatives. In experiments on vision-based navigation and manipulation domains, we show that the algorithm allows for unsupervised meta-leaming that transfers to downstream tasks specified by hand-crafted reward functions and serves as pre-training for more efficient supervised meta-learning of test task distributions.

引用

页数：13

共 65 条

[1]

Achiam J., 2018, ARXIV180710299

[2]

[Anonymous], 2011, POWERPLAY TRAINING I

[3]

[Anonymous], 2015, Advances in neural information processing systems

[4]

Antoniou Antreas, 2019, ARXIV190209884V3

[5]

Barber D, 2004, ADV NEUR IN, V16, P201

[6] AN INFORMATION MAXIMIZATION APPROACH TO BLIND SEPARATION AND BLIND DECONVOLUTION [J].

BELL, AJ ;

SEJNOWSKI, TJ .

NEURAL COMPUTATION, 1995, 7 (06) :1129-1159

[7]

Bellemare M., 2016, Advances in Neural Information Processing Systems, P1471

[8]

Bojanowski P., 2017, INT C MACH LEARN ICM

[9]

Botvinick Matthew, 2019, TRENDS COGNITIVE SCI, V23

[10]

Burda Yuri, 2019, INT C LEARN REPR

← 1 2 3 4 5 6 7 →