Generalized Maximum Entropy Reinforcement Learning via Reward Shaping

被引：2

作者：

Tao F. ^{[1
]}

Wu M. ^{[2
]}

Cao Y. ^{[2
]}

机构：

[1] Volvo Car Technology Usa Llc, Sunnyvale, 94085, CA

[2] University of Texas, Department of Electrical Engineering, San Antonio, 78249, TX

来源：

IEEE Transactions on Artificial Intelligence | 2024年 / 5卷 / 04期

关键词：

Entropy; reinforcement learning (RL); reward-shaping;

D O I：

10.1109/TAI.2023.3297988

中图分类号：

学科分类号：

摘要：

Entropy regularization is a commonly used technique in reinforcement learning to improve exploration and cultivate a better pre-trained policy for later adaptation. Recent studies further show that the use of entropy regularization can smooth the optimization landscape and simplify the policy optimization process, indicating the value of integrating entropy into reinforcement learning. However, existing studies only consider the policy's entropy at the current state as an extra regularization term in the policy gradient or in the objective function without formally integrating the entropy in the reward function. In this article, we propose a shaped reward that includes the agent's policy entropy into the reward function. In particular, the agent's expected entropy over a distribution of the next state is added to the immediate reward associated with the current state. The addition of the agent's expected policy entropy at the next state distribution is shown to yield new soft Q-function and state function that are concise and modular. Moreover, the new reinforcement learning framework can be easily applied to the existing standard reinforcement learning algorithms, such as deep q-network (DQN) and proximal policy optimization (PPO), while inheriting the benefits of employing entropy regularization. We further present a soft stochastic policy gradient theorem based on the shaped reward and propose a new practical reinforcement learning algorithm. Finally, a few experimental studies are conducted in MuJoCo environment to demonstrate that our method can outperform an existing state-of-the-art off-policy maximum entropy reinforcement learning approach soft actor-critic by 5%-150% in terms of average return. © 2020 IEEE.

引用

页码：1563 / 1572

页数：9

共 50 条

[31] The generalized maximum belief entropy model
Li, Siran
Cai, Rui
SOFT COMPUTING, 2022, 26 (09) : 4187 - 4198
[32] Generalized Maximum Entropy for Supervised Classification
Mazuelas, Santiago
Shen, Yuan
Perez, Aritz
IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (04) : 2530 - 2550
[33] The generalized maximum belief entropy model
Siran Li
Rui Cai
Soft Computing, 2022, 26 : 4187 - 4198
[34] Potential-based reward shaping using state-space segmentation for efficiency in reinforcement learning
Bal, Melis Ilayda
Aydin, Hueseyin
Iyiguen, Cem
Polat, Faruk
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 157 : 469 - 484
[35] A Multi-Dimensional Goal Aircraft Guidance Approach Based on Reinforcement Learning with a Reward Shaping Algorithm
Zu, Wenqiang
Yang, Hongyu
Liu, Renyu
Ji, Yulong
SENSORS, 2021, 21 (16)
[36] Population-based exploration in reinforcement learning through repulsive reward shaping using eligibility traces
Bal, Melis Ilayda
Iyigun, Cem
Polat, Faruk
Aydin, Huseyin
ANNALS OF OPERATIONS RESEARCH, 2024, 335 (02) : 689 - 725
[37] Multi-Agent Meta-Reinforcement Learning with Coordination and Reward Shaping for Traffic Signal Control
Du, Xin
Wang, Jiahai
Chen, Siyuan
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT II, 2023, 13936 : 349 - 360
[38] Car-Following Behavior Modeling With Maximum Entropy Deep Inverse Reinforcement Learning
Nan, Jiangfeng
Deng, Weiwen
Zhang, Ruzheng
Zhao, Rui
Wang, Ying
Ding, Juan
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024, 9 (02): : 3998 - 4010
[39] Reward shaping via expectation maximization method
Deng, Zelin
Liu, Xing
Dong, Yunlong
NEUROCOMPUTING, 2024, 609
[40] Bottom-up multi-agent reinforcement learning by reward shaping for cooperative-competitive tasks
Aotani, Takumi
Kobayashi, Taisuke
Sugimoto, Kenji
APPLIED INTELLIGENCE, 2021, 51 (07) : 4434 - 4452

← 1 2 3 4 5 →