Exploring and Exploiting Conditioning of Reinforcement Learning Agents

被引：6

作者：

Asadulaev, Arip ^{[1
]}

Kuznetsov, Igor ^{[1
]}

Stein, Gideon ^{[1
]}

Filchenkov, Andrey ^{[1
]}

机构：

[1] ITMO Univ, Machine Learning Lab, St Petersburg 197101, Russia

来源：

IEEE ACCESS | 2020年 / 8卷

基金：

俄罗斯科学基金会;

关键词：

Reinforcement learning; neural networks; policy optimization; generalization; regularization; conditioning; NEURAL-NETWORKS; GO;

D O I：

10.1109/ACCESS.2020.3037276

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The outcome of Jacobian singular values regularization was studied for supervised learning problems. In supervised learning settings for linear and nonlinear networks, Jacobian regularization allows for faster learning. It also was shown that Jacobian conditioning regularization can help to avoid the "mode-collapse" problem in Generative Adversarial Networks. In this paper, we try to answer the following question: Can information about policy network Jacobian conditioning help to shape a more stable and general policy of reinforcement learning agents? To answer this question, we conduct a study of Jacobian conditioning behavior during policy optimization. We analyze the behavior of the agent conditioning on different policies under the different sets of hyperparameters and study a correspondence between the conditioning and the ratio of achieved rewards. Based on these observations, we propose a conditioning regularization technique. We apply it to Trust Region Policy Optimization and Proximal Policy Optimization (PPO) algorithms and compare their performance on 8 continuous control tasks. Models with the proposed regularization outperformed other models on most of the tasks. Also, we showed that the regularization improves the agent's generalization by comparing the PPO performance on CoinRun environments. Also, we propose an algorithm that uses the condition number of the agent to form a robust policy, which we call Jacobian Policy Optimization (JPO). It directly estimates the condition number of an agent's Jacobian and changes the policy trend. We compare it with PPO on several continuous control tasks in PyBullet environments and the proposed algorithm provides a more stable and efficient reward growth on a range of agents.

引用

页码：211951 / 211960

页数：10

共 29 条

[1] [Anonymous], 2018, P 35 INT C MACH LEAR
[2] [Anonymous], 2016, Asynchronous methods for deep reinforcement learning
[3] Beattie K, 2016, BFI FILM CLASSICS, P1
[4] Berenji H. R., 2019, P IEEE 5 INT FUZZ SY, P1282
[5] Ellenberger B., 2018, OPEN SOURCE IMPLEMEN
[6] Espeholt L, 2018, PR MACH LEARN RES, V80
[7] Farebrother J., 2018, CORR, P1
[8] Galashov A., 2019, INT C LEARNING REPRE, P1
[9] Igl M., 2019, Adv. Neural Info. Proc. Syst., V32, P13978
[10] Ilyas A., 2018, CORR, P1

← 1 2 3 →