High-Value Prioritized Experience Replay for Off-policy Reinforcement Learning

被引：25

作者：

Cao, Xi ^{[1
,2
]}

Wan, Huaiyu ^{[1
,2
]}

Lin, Youfang ^{[1
,2
]}

Han, Sheng ^{[1
,2
]}

机构：

[1] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing Key Lab Traff Data Anal & Min, Beijing, Peoples R China

[2] CAAC, Key Lab Intelligent Passenger Serv Civil Aviat, Beijing, Peoples R China

来源：

2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019) | 2019年

基金：

中国国家自然科学基金;

关键词：

deep reinforcement learning; experience replay; high-value; temporal-difference error;

D O I：

10.1109/ICTAI.2019.00215

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In deep reinforcement learning, experience replay has been shown an effective solution to handle sample-inefficiency. Prioritized Experience Replay (PER) uses temporal-difference error (TD error) as replay priority in Deep Q-Networks (DQN), so that agent can learn more effectively from important experiences. However, experiences with large TD error may appear near the edge of state space and these experiences do not help agent learn policy quickly. We present a novel technique called High-Value Prioritized Experience Replay (HVPER), which designs a combination of TD error and value (reward or state-action value) in replay priority. Specifically, we first propose prioritizing replay based on reward and TD error in sparse reward environment. Extendedly, we design prioritizing replay based on state-action value and TD error for more ordinary environment. We design experiments in the gym environment to evaluate the proposed HVPER. First, we verify that the combination of TD error and reward improves the training speed in two problems with sparse rewards compared to DQN algorithm and PER algorithm. In addition, HVPER accelerates the network learning and achieves a better performance in two continuous space problems compared to Deep Deterministic Policy Gradient algorithm.

引用

页码：1510 / 1514

页数：5

共 17 条

[1] Andrychowicz M., 2017, Advances in neural information processing systems, P5048
[2] DYNAMIC PROGRAMMING
BELLMAN, R
[J]. SCIENCE, 1966, 153 (3731) : 34 - &
[3] De Bruin T., 2015, NIPS DEEP REINF LEAR
[4] Horgan Dan, 2018, DISTRIBUTED PRIORITI
[5] Hou Y, 2017, IEEE INT C SYST
[6] Ke Fengzhen, COMPUTER ENG APPL
[7] Lanka S., 2018, ARCHER: Aggressive Rewards to Counter bias in Hindsight Experience Replay
[8] Li Y, 2017, P ADV NEUR INF PROC, V30, P3812
[9] Lillicrap TP, 2015, ARXIV150902971
[10] SELF-IMPROVING REACTIVE AGENTS BASED ON REINFORCEMENT LEARNING, PLANNING AND TEACHING
LIN, LJ
[J]. MACHINE LEARNING, 1992, 8 (3-4) : 293 - 321

← 1 2 →