The Value Equivalence Principle for Model-Based Reinforcement Learning

被引:0
作者
Grimm, Christopher [1 ]
Barreto, Andre [2 ]
Singh, Satinder [2 ]
Silver, David [2 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
[2] DeepMind, London, England
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning models of the environment from data is often viewed as an essential component to building intelligent reinforcement learning (RL) agents. The common practice is to separate the learning of the model from its use, by constructing a model of the environment's dynamics that correctly predicts the observed state transitions. In this paper we argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning. As our main contribution, we introduce the principle of value equivalence: two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates. We propose a formulation of the model learning problem based on the value equivalence principle and analyze how the set of feasible solutions is impacted by the choice of policies and functions. Specifically, we show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks, until eventually collapsing to a single point corresponding to a model that perfectly describes the environment. In many problems, directly modelling state-to-state transitions may be both difficult and unnecessary. By leveraging the value-equivalence principle one may find simpler models without compromising performance, saving computation and memory. We illustrate the benefits of value-equivalent model learning with experiments comparing it against more traditional counterparts like maximum likelihood estimation. More generally, we argue that the principle of value equivalence underlies a number of recent empirical successes in RL, such as Value Iteration Networks, the Predictron, Value Prediction Networks, TreeQN, and MuZero, and provides a first theoretical underpinning of those results.
引用
收藏
页数:12
相关论文
共 45 条
  • [1] Abachi Romina, 2020, CSAI200300030 ARXIV
  • [2] [Anonymous], 2017, P INT C MACH LEARN
  • [3] Asadi Kavosh, 2018, FAIM WORKSH PRED GEN
  • [4] Ayoub A, 2020, PR MACH LEARN RES, V119
  • [5] NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS
    BARTO, AG
    SUTTON, RS
    ANDERSON, CW
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05): : 834 - 846
  • [6] Bellemare AG, 2019, ADV NEURAL INFORM PR, V32, P4360
  • [7] Bertsekas D. P., 1996, Neuro dynamic programming, V5
  • [8] Biza Ondrej, 2020, ARXIV200304300
  • [9] Castro PS, 2020, AAAI CONF ARTIF INTE, V34, P10069
  • [10] Corneil D., 2018, ARXIV180204325