Model gradient: unified model and policy learning in model-based reinforcement learning

被引:3
作者
Jia, Chengxing [1 ,2 ]
Zhang, Fuxiang [1 ,2 ]
Xu, Tian [1 ,2 ]
Pang, Jing-Cheng [1 ,2 ]
Zhang, Zongzhang [1 ]
Yu, Yang [1 ,2 ]
机构
[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Peoples R China
[2] Polixir Technol, Nanjing 210000, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
reinforcement learning; model-based reinforcement learning; Markov decision process; GO;
D O I
10.1007/s11704-023-3150-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment. Previous model learning methods aim at fitting the transition data, and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state. The supervised model learning methods, however, diverge from the ultimate goal of model learning, i.e., optimizing the learned-in-the-model policy. In this work, we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment. We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data. We thus derive the gradient of the model from this target and propose the Model Gradient algorithm (MG) to integrate this novel model learning approach with policy-gradient-based policy optimization. We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.
引用
收藏
页数:12
相关论文
共 67 条
  • [31] Kidambi R, 2020, ADV NEUR IN, V33
  • [32] GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models
    Ko, Jonathan
    Fox, Dieter
    [J]. AUTONOMOUS ROBOTS, 2009, 27 (01) : 75 - 90
  • [33] Kumar V, 2016, IEEE INT CONF ROBOT, P378, DOI 10.1109/ICRA.2016.7487156
  • [34] Kurutach T., 2018, ICLR
  • [35] Levine S., 2013, INT C MACH LEARN
  • [36] Lioutikov R, 2014, IEEE INT CONF ROBOT, P3896, DOI 10.1109/ICRA.2014.6907424
  • [37] Lovatto A G, 2020, P I CANT BEL ITS NOT
  • [38] Luo FM, 2022, Arxiv, DOI [arXiv:2206.09328, 10.48550/arXiv.2206.09328]
  • [39] Luo Y., 2019, P INT C LEARN REPR
  • [40] Melo Luckeciano C., 2022, PR MACH LEARN RES