DNA: Proximal Policy Optimization with a Dual Network Architecture

被引:0
作者
Aitchison, Matthew [1 ]
Sweetser, Penny [1 ]
机构
[1] Australian Natl Univ, Canberra, ACT, Australia
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年
关键词
ARCADE LEARNING-ENVIRONMENT; REINFORCEMENT; SHOGI; CHESS; GO;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal due to an order-of-magnitude difference in noise levels between the two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that policy gradient noise levels decrease when using a lower variance return estimate. Whereas, value learning noise level decreases with a lower bias estimate. Together these insights inform an extension to Proximal Policy Optimization we call Dual Network Architecture (DNA), which significantly outperforms its predecessor. DNA also exceeds the performance of the popular Rainbow DQN algorithm on four of the five environments tested, even under more difficult stochastic control settings.
引用
收藏
页数:12
相关论文
共 37 条
  • [1] Aitchison M., 2022, ARXIV221002019
  • [2] Aitchison M, 2019, IEEE CONF COMPU INTE
  • [3] Andrychowicz M., 2020, INT C LEARN REPR
  • [4] [Anonymous], 2016, International Conference on Machine Learning, DOI DOI 10.48550/ARXIV.1602.01783
  • [5] [Anonymous], 2016, INT C MACH LEARN
  • [6] Badia AP, 2020, PR MACH LEARN RES, V119
  • [7] Bellemare MG, 2017, PR MACH LEARN RES, V70
  • [8] The Arcade Learning Environment: An Evaluation Platform for General Agents
    Bellemare, Marc G.
    Naddaf, Yavar
    Veness, Joel
    Bowling, Michael
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 : 253 - 279
  • [9] Burda Y., 2018, P INT C LEARN REPR
  • [10] Chen Xinyue, 2021, ARXIV210105982