Improve generated adversarial imitation learning with reward variance regularization

被引:8
作者
Zhang, Yi-Feng [1 ]
Luo, Fan-Ming [1 ]
Yu, Yang [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
关键词
Imitation learning; Reinforcement learning; Generative adversarial model; Discriminator reward;
D O I
10.1007/s10994-021-06083-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Imitation learning aims at recovering expert policies from limited demonstration data. Generative Adversarial Imitation Learning (GAIL) employs the generative adversarial learning framework for imitation learning and has shown great potentials. GAIL and its variants, however, are found highly sensitive to hyperparameters and hard to converge well in practice. One key issue is that the supervised learning discriminator has a much faster learning speed than the reinforcement learning generator, making the generator gradient vanishing. Although GAIL is formulated as a zero-sum adversarial game, the ultimate goal of GAIL is to learn the generator, thus the discriminator should play the role more like a teacher rather than a real opponent. Therefore, the learning of the discriminator should consider how the generator could learn. In this paper, we disclose that enhancing the gradient of the generator training is equivalent to increase the variance of the fake reward provided by the discriminator output. We thus propose an improved version of GAIL, GAIL-VR, in which the discriminator also learns to avoid generator gradient vanishing through regularization of the fake rewards variance. Experiments in various tasks, including locomotion tasks and Atari games, indicate that GAIL-VR can improve the training stability and imitation scores.
引用
收藏
页码:977 / 995
页数:19
相关论文
共 26 条
  • [1] Abbeel P., 2004, Apprenticeship learning via inverse reinforcement learning. pages, P1, DOI DOI 10.1145/1015330.1015430
  • [2] Baram N, 2017, PR MACH LEARN RES, V70
  • [3] Bhattacharyya RP, 2018, IEEE INT C INT ROBOT, P1534, DOI 10.1109/IROS.2018.8593758
  • [4] Chen M., 2020, ARXIV200102792 CORR
  • [5] Finn C, 2016, PR MACH LEARN RES, V48
  • [6] Finn Chelsea, 2016, ABS161103852 CORR
  • [7] Fu J., 2017, ARXIV PREPRINT ARXIV
  • [8] Geng, 2020, ARXIV200707443 CORR
  • [9] Heess N, 2015, ADV NEUR IN, V28
  • [10] Ho J, 2016, ADV NEUR IN, V29