High-Value Prioritized Experience Replay for Off-policy Reinforcement Learning

被引:25
作者
Cao, Xi [1 ,2 ]
Wan, Huaiyu [1 ,2 ]
Lin, Youfang [1 ,2 ]
Han, Sheng [1 ,2 ]
机构
[1] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing Key Lab Traff Data Anal & Min, Beijing, Peoples R China
[2] CAAC, Key Lab Intelligent Passenger Serv Civil Aviat, Beijing, Peoples R China
来源
2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019) | 2019年
基金
中国国家自然科学基金;
关键词
deep reinforcement learning; experience replay; high-value; temporal-difference error;
D O I
10.1109/ICTAI.2019.00215
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In deep reinforcement learning, experience replay has been shown an effective solution to handle sample-inefficiency. Prioritized Experience Replay (PER) uses temporal-difference error (TD error) as replay priority in Deep Q-Networks (DQN), so that agent can learn more effectively from important experiences. However, experiences with large TD error may appear near the edge of state space and these experiences do not help agent learn policy quickly. We present a novel technique called High-Value Prioritized Experience Replay (HVPER), which designs a combination of TD error and value (reward or state-action value) in replay priority. Specifically, we first propose prioritizing replay based on reward and TD error in sparse reward environment. Extendedly, we design prioritizing replay based on state-action value and TD error for more ordinary environment. We design experiments in the gym environment to evaluate the proposed HVPER. First, we verify that the combination of TD error and reward improves the training speed in two problems with sparse rewards compared to DQN algorithm and PER algorithm. In addition, HVPER accelerates the network learning and achieves a better performance in two continuous space problems compared to Deep Deterministic Policy Gradient algorithm.
引用
收藏
页码:1510 / 1514
页数:5
相关论文
共 17 条