Generalized Maximum Entropy Reinforcement Learning via Reward Shaping

被引:2
作者
Tao F. [1 ]
Wu M. [2 ]
Cao Y. [2 ]
机构
[1] Volvo Car Technology Usa Llc, Sunnyvale, 94085, CA
[2] University of Texas, Department of Electrical Engineering, San Antonio, 78249, TX
来源
IEEE Transactions on Artificial Intelligence | 2024年 / 5卷 / 04期
关键词
Entropy; reinforcement learning (RL); reward-shaping;
D O I
10.1109/TAI.2023.3297988
中图分类号
学科分类号
摘要
Entropy regularization is a commonly used technique in reinforcement learning to improve exploration and cultivate a better pre-trained policy for later adaptation. Recent studies further show that the use of entropy regularization can smooth the optimization landscape and simplify the policy optimization process, indicating the value of integrating entropy into reinforcement learning. However, existing studies only consider the policy's entropy at the current state as an extra regularization term in the policy gradient or in the objective function without formally integrating the entropy in the reward function. In this article, we propose a shaped reward that includes the agent's policy entropy into the reward function. In particular, the agent's expected entropy over a distribution of the next state is added to the immediate reward associated with the current state. The addition of the agent's expected policy entropy at the next state distribution is shown to yield new soft Q-function and state function that are concise and modular. Moreover, the new reinforcement learning framework can be easily applied to the existing standard reinforcement learning algorithms, such as deep q-network (DQN) and proximal policy optimization (PPO), while inheriting the benefits of employing entropy regularization. We further present a soft stochastic policy gradient theorem based on the shaped reward and propose a new practical reinforcement learning algorithm. Finally, a few experimental studies are conducted in MuJoCo environment to demonstrate that our method can outperform an existing state-of-the-art off-policy maximum entropy reinforcement learning approach soft actor-critic by 5%-150% in terms of average return. © 2020 IEEE.
引用
收藏
页码:1563 / 1572
页数:9
相关论文
共 50 条
  • [41] Reward shaping-based deep reinforcement learning for look-ahead dispatch with rolling-horizon
    Xu, Hongsheng
    Xu, Yungui
    Wang, Ke
    Li, Yaping
    Al Ahad, Abdullah
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2025, 168
  • [42] Bottom-up multi-agent reinforcement learning by reward shaping for cooperative-competitive tasks
    Aotani, Takumi
    Kobayashi, Taisuke
    Sugimoto, Kenji
    APPLIED INTELLIGENCE, 2021, 51 (07) : 4434 - 4452
  • [43] Dynamic flexible job-shop scheduling by multi-agent reinforcement learning with reward-shaping
    Zhang, Lixiang
    Yan, Yan
    Yang, Chen
    Hu, Yaoguang
    ADVANCED ENGINEERING INFORMATICS, 2024, 62
  • [44] Bottom-up multi-agent reinforcement learning by reward shaping for cooperative-competitive tasks
    Takumi Aotani
    Taisuke Kobayashi
    Kenji Sugimoto
    Applied Intelligence, 2021, 51 : 4434 - 4452
  • [45] Adaptive Quadruped Balance Control for Dynamic Environments Using Maximum-Entropy Reinforcement Learning
    Sun, Haoran
    Fu, Tingting
    Ling, Yuanhuai
    He, Chaoming
    SENSORS, 2021, 21 (17)
  • [46] Maximum Entropy Inverse Reinforcement Learning Using Monte Carlo Tree Search for Autonomous Driving
    da Silva, Junior Anderson Rodrigues
    Grassi Jr, Valdir
    Wolf, Denis Fernando
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (09) : 11552 - 11562
  • [47] On regularization of generalized maximum entropy for linear models
    Maneejuk, Paravee
    SOFT COMPUTING, 2021, 25 (12) : 7867 - 7875
  • [48] On regularization of generalized maximum entropy for linear models
    Paravee Maneejuk
    Soft Computing, 2021, 25 : 7867 - 7875
  • [49] Parameterized MDPs and Reinforcement Learning Problems--A Maximum Entropy Principle-Based Framework
    Srivastava, Amber
    Salapaka, Srinivasa M.
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (09) : 9339 - 9351
  • [50] Skill Reward for Safe Deep Reinforcement Learning
    Cheng, Jiangchang
    Yu, Fumin
    Zhang, Hongliang
    Dai, Yinglong
    UBIQUITOUS SECURITY, 2022, 1557 : 203 - 213