A Task-Agnostic Regularizer for Diverse Subpolicy Discovery in Hierarchical Reinforcement Learning

被引：8

作者：

Huo, Liangyu ^{[1
]}

Wang, Zulin ^{[1
]}

Xu, Mai ^{[1
]}

Song, Yuhang ^{[2
]}

机构：

[1] Beihang Univ, Sch Elect & Informat Engn, Beijing 100083, Peoples R China

[2] Univ Oxford, Dept Comp Sci, Oxford OX1 2JD, England

来源：

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS | 2023年 / 53卷 / 03期

关键词：

Task analysis; Reinforcement learning; Trajectory; Training; Computational modeling; Standards; Knowledge engineering; Hierarchical reinforcement learning (HRL); regularization; reinforcement learning (RL); subpolicy discovery;

D O I：

10.1109/TSMC.2022.3209070

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The automatic subpolicy discovery approach in hierarchical reinforcement learning (HRL) has recently achieved promising performance on sparse reward tasks. This accelerates transfer learning and unsupervised intelligent creatures while eliminating the domain-specific knowledge constraint. Most previously developed approaches are demonstrated to suffer from collapsing into the situation where one subpolicy dominates the whole task, since they cannot ensure the diversity of different subpolicies. In contrast, this article proposes a task-agnostic regularizer (TAR) for learning diverse subpolicies in HRL. Specifically, we first formulate the discovery of diverse subpolicies as a trajectory inference problem and then propose a corresponding information-theoretic objective to encourage diversity. Subsequently, considering computability, we instantiate the objective as two simplifications for discrete and continuous action spaces. We extensively evaluate the proposed diversity-driven regularizer on three HRL task domains: 1) meta reinforcement learning; 2) hierarchical policy learning in the option framework; and 3) unsupervised subpolicy discovery. The extensive results obtained show that our TAR approach can improve upon the state-of-the-art performance on all three HRL domains without modifying any existing hyperparameters, indicating the wide applicability and robustness of our approach.

引用

页码：1932 / 1944

页数：13

共 43 条

[1]

Bacon PL, 2017, AAAI CONF ARTIF INTE, P1726

[2] Looking Back on the Actor-Critic Architecture [J].

Barto, Andrew G. ;

Sutton, Richard S. ;

Anderson, Charles W. .

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2021, 51 (01) :40-50

[3]

Dayan Peter, 1992, Advances in Neural Information Processing Systems, V5

[4]

Eysenbach B., 2018, arXiv

[5]

Finn C, 2017, PR MACH LEARN RES, V70

[6]

Florensa C., 2017, ARXIV

[7]

Frans K., 2017, ARXIV

[8]

Fu J., 2017, arXiv

[9]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672

[10]

Goyal A, 2022, PR MACH LEARN RES

← 1 2 3 4 5 →