Learning Potential in Subgoal-Based Reward Shaping

被引：0

作者：

Okudo, Takato ^{[1
]}

Yamada, Seiji

机构：

[1] Grad Univ Adv Studies SOKENDAI, Dept Informat, Tokyo 1018430, Japan

来源：

IEEE ACCESS | 2023年 / 11卷

基金：

日本科学技术振兴机构;

关键词：

Trajectory; Reinforcement learning; Human factors; Planning; Deep learning; Optimization; Machine learning algorithms; deep reinforcement learning; subgoals; reward shaping; potential-based reward shaping; subgoal-based reward shaping;

D O I：

10.1109/ACCESS.2023.3246267

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Human knowledge can reduce the number of iterations required to learn in reinforcement learning. Though the most common approach uses trajectories, it is difficult to acquire them in certain domains. Subgoals, which are intermediate states, have been studied instead of trajectories. Subgoal-based reward shaping is a method that adds rewards to environmental rewards with a sequence of subgoals. The potential function, which is a component of subgoal-based reward shaping, is shaped by a hyperparameter that controls its output. However, it is not easy to select a hyperparameter because its appropriate value depends on the reward function of an environment, and the reward function is unknown but its output is available. We propose learned potential that parameterizes a hyperparameter and acquires its potential through learning. A value is an expected accumulated reward if an agent follows its policy after the current state and is strongly related to the reward function. With learned potential, we build an abstract state space, which is a higher-level representation of the state, with a sequence of subgoals and use the value over the abstract states as the potential to accelerate the value learning. N-step temporal-difference (TD) method learns the values over the abstract state. We conducted experiments to evaluate the effectiveness of learned potential, and the results indicate its effectiveness compared with a baseline reinforcement learning algorithm and several reward-shaping algorithms. The results also indicate that the participants' subgoals are superior to subgoals generated randomly with learned potential. We discuss the appropriate number of subgoals for learned potential, that partially ordered subgoal is helpful for learned potential, that learned potential cannot make learning efficient in step penalized rewards, and that learned potential is superior to the non-learned potential in mixed positive and negative rewards.

引用

页码：17116 / 17137

页数：22

共 62 条

[1] Agarap A. Fred., 2018, PREPRINT, DOI DOI 10.48550/ARXIV.1803.08375
[2] Optuna: A Next-generation Hyperparameter Optimization Framework
Akiba, Takuya
Sano, Shotaro
Yanase, Toshihiko
Ohta, Takeru
Koyama, Masanori
[J]. KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 2623 - 2631
[3] Power to the People: The Role of Humans in Interactive Machine Learning
Amershi, Saleema
Cakmak, Maya
Knox, W. Bradley
Kulesza, Todd
[J]. AI MAGAZINE, 2014, 35 (04) : 105 - 120
[4] Amodei D, 2016, Arxiv, DOI arXiv:1606.06565
[5] Andrychowicz M, 2017, NIPS
[6] [Anonymous], 2019, Advances in neural information processing systems
[7] A survey of inverse reinforcement learning: Challenges, methods and progress
Arora, Saurabh
Doshi, Prashant
[J]. ARTIFICIAL INTELLIGENCE, 2021, 297 (297)
[8] Bacon PL, 2017, AAAI CONF ARTIF INTE, P1726
[9] Bergstra J, 2013, 30th Int Conf Mach Learn, V28, P115, DOI DOI 10.5555/3042817.3042832
[10] Brockman G, 2016, Arxiv, DOI arXiv:1606.01540

← 1 2 3 4 5 6 7 →