Improving Actor-Critic Reinforcement Learning via Hamiltonian Monte Carlo Method

被引:1
作者
Xu D. [1 ]
Fekri F. [1 ]
机构
[1] Georgia Institute of Technology, School of Electrical and Computer Engineering, Atlanta, 30332, GA
来源
IEEE Transactions on Artificial Intelligence | 2023年 / 4卷 / 06期
关键词
Hamiltonian Monte Carlo (HMC); learning to control; reinforcement learning (RL); soft actor-critic (SAC);
D O I
10.1109/TAI.2022.3215614
中图分类号
学科分类号
摘要
The actor-critic reinforcement learning (RL) is widely used in various robotic control tasks. By viewing the actor-critic RL from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions, given the optimality criteria. However, in practice, the actor-critic RL may yield suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate the policy network of actor-critic RL with HMC, which is termed as Hamiltonian policy. As such we propose to evolve actions from the base policy according to HMC, and our proposed method has many benefits. First, HMC can improve the policy distribution to better approximate the posterior and hence reduce the amortization gap. Second, HMC can also guide the exploration more to the regions of action spaces with higher Q values, enhancing the exploration efficiency. Further, instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. Finally, in safe RL problems, we find that the proposed method can not only improve the achieved return, but also reduce safety constraint violations by discarding potentially unsafe actions. With comprehensive empirical experiments on continuous control baselines, including MuJoCo and PyBullet Roboschool, we show that the proposed approach is a data-efficient and easy-to-implement improvement over previous actor-critic methods. © 2020 IEEE.
引用
收藏
页码:1642 / 1653
页数:11
相关论文
共 51 条
  • [1] Bharadhwaj H., Xie K., Shkurti F., Model-predictive control via cross-entropy and gradient-based optimization, Proc. Learn. Dyn. Control, pp. 277-286, (2020)
  • [2] Botvinick M., Toussaint M., Planning as inference, Trends Cogn. Sci., 16, 10, pp. 485-488, (2012)
  • [3] Caterini A.L., Doucet A., Sejdinovic D., Hamiltonian variational auto-encoder, Proc. Adv. Neural Inf. Process. Syst., pp. 8167-8177, (2018)
  • [4] Chow Y., Nachum O., Duenez-Guzman E., Ghavamzadeh M., A lyapunov-based approach to safe reinforcement learning, Adv. Neural Inf. Process. Syst., 31, (2018)
  • [5] Chow Y., Nachum O., Faust A., Duenez-Guzman E., Ghavamzadeh M., Lyapunov-based Safe Policy Optimization for Continuous Control, (2019)
  • [6] Chua K., Calandra R., McAllister R., Levine S., Deep reinforcement learning in a handful of trials using probabilistic dynamics models, Proc. 32nd Int. Conf. Neural Inf. Process. Syst., pp. 4759-4770, (2018)
  • [7] Ciosek K., Vuong Q., Loftin R., Hofmann K., Better exploration with optimistic actor critic, Proc. Adv. Neural Inf. Process. Syst., 32, pp. 1787-1798, (2019)
  • [8] Coumans E., Bai Y., PyBullet, a Python module for physics simulation for games, robotics and machine learning, (2016)
  • [9] Cremer C., Li X., Duvenaud D., Inference suboptimality in variational autoencoders, Proc. Int. Conf. Mach. Learn., pp. 1078-1086, (2018)
  • [10] Dayan P., Hinton G.E., Using expectation-maximization for reinforcement learning, Neural Comput., 9, 2, pp. 271-278, (1997)