Off-Policy Reinforcement Learning with Delayed Rewards

被引：0

作者：

Han, Beining ^{[1
]}

Ren, Zhizhou ^{[2
,3
]}

Wu, Zuofan ^{[3
]}

Zhou, Yuan ^{[4
]}

Peng, Jian ^{[2
,3
,5
]}

机构：

[1] Tsinghua Univ, Inst Interdisciplinary Informat Sci, Beijing, Peoples R China

[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA

[3] Helixon Ltd, Beijing, Peoples R China

[4] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China

[5] Tsinghua Univ, Inst Ind AI Res, Beijing, Peoples R China

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.

引用

页数：24

共 50 条

[1] Safe and efficient off-policy reinforcement learning
Munos, Remi
Stepleton, Thomas
Harutyunyan, Anna
Bellemare, Marc G.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[2] Bounds for Off-policy Prediction in Reinforcement Learning
Joseph, Ajin George
Bhatnagar, Shalabh
2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 3991 - 3997
[3] Off-Policy Reinforcement Learning with Gaussian Processes
Girish Chowdhary
Miao Liu
Robert Grande
Thomas Walsh
Jonathan How
Lawrence Carin
IEEE/CAAJournalofAutomaticaSinica, 2014, 1 (03) : 227 - 238
[4] Representations for Stable Off-Policy Reinforcement Learning
Ghosh, Dibya
Bellemare, Marc G.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
[5] A perspective on off-policy evaluation in reinforcement learning
Li, Lihong
FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
[6] On the Reuse Bias in Off-Policy Reinforcement Learning
Ying, Chengyang
Hao, Zhongkai
Zhou, Xinning
Su, Hang
Yan, Dong
Zhu, Jun
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4513 - 4521
[7] A perspective on off-policy evaluation in reinforcement learning
Lihong Li
Frontiers of Computer Science, 2019, 13 : 911 - 912
[8] Reliable Off-Policy Evaluation for Reinforcement Learning
Wang, Jie
Gao, Rui
Zha, Hongyuan
OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
[9] Sequential Search with Off-Policy Reinforcement Learning
Miao, Dadong
Wang, Yanan
Tang, Guoyu
Liu, Lin
Xu, Sulong
Long, Bo
Xiao, Yun
Wu, Lingfei
Jiang, Yunjiang
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4006 - 4015
[10] Representations for Stable Off-Policy Reinforcement Learning
Ghosh, Dibya
Bellemare, Marc G.
25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019), 2019,

← 1 2 3 4 5 →