Towards an Understanding of Default Policies in Multitask Policy Optimization

被引：0

作者：

Moskovitz, Ted ^{[1
]}

Arbel, Michael ^{[2
]}

Parker-Holder, Jack ^{[3
]}

Pacchiano, Aldo ^{[4
]}

机构：

[1] UCL, Gatsby Unit, London, England

[2] INRIA, Paris, France

[3] Univ Oxford, Oxford, England

[4] Microsoft Res, Redmond, WA USA

来源：

INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151 | 2022年 / 151卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.

引用

页数：26

共 47 条

[1]

Abdolmaleki A., 2018, Maximum a Posteriori Policy Optimisation

[2]

Agarwal A., 2020, P C LEARN THEOR, P64

[3]

Agarwal R., 2021, Contrastive behavioral similarity embeddings for generalization in reinforcement learning

[4]

Ahmed Zafarali, 1951, P MACHINE LEARNING R, V97, P151

[5]

[Anonymous], 2018, Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

[6]

[Anonymous], 2017, NIPS

[7]

[Anonymous], 2015, CoRR

[8] Fast reinforcement learning with generalized policy updates [J].

Barreto, Andre ;

Hou, Shaobo ;

Borsa, Diana ;

Silver, David ;

Precup, Doina .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2020, 117 (48) :30079-30087

[9]

Berner C., 2019, CoRR

[10]

Espeholt L, 2018, PR MACH LEARN RES, V80

← 1 2 3 4 5 →