Off-Policy Reinforcement Learning with Delayed Rewards

被引:0
|
作者
Han, Beining [1 ]
Ren, Zhizhou [2 ,3 ]
Wu, Zuofan [3 ]
Zhou, Yuan [4 ]
Peng, Jian [2 ,3 ,5 ]
机构
[1] Tsinghua Univ, Inst Interdisciplinary Informat Sci, Beijing, Peoples R China
[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[3] Helixon Ltd, Beijing, Peoples R China
[4] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China
[5] Tsinghua Univ, Inst Ind AI Res, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] Safe and efficient off-policy reinforcement learning
    Munos, Remi
    Stepleton, Thomas
    Harutyunyan, Anna
    Bellemare, Marc G.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [2] Bounds for Off-policy Prediction in Reinforcement Learning
    Joseph, Ajin George
    Bhatnagar, Shalabh
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 3991 - 3997
  • [3] Off-Policy Reinforcement Learning with Gaussian Processes
    Girish Chowdhary
    Miao Liu
    Robert Grande
    Thomas Walsh
    Jonathan How
    Lawrence Carin
    IEEE/CAAJournalofAutomaticaSinica, 2014, 1 (03) : 227 - 238
  • [4] Representations for Stable Off-Policy Reinforcement Learning
    Ghosh, Dibya
    Bellemare, Marc G.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [5] A perspective on off-policy evaluation in reinforcement learning
    Li, Lihong
    FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
  • [6] On the Reuse Bias in Off-Policy Reinforcement Learning
    Ying, Chengyang
    Hao, Zhongkai
    Zhou, Xinning
    Su, Hang
    Yan, Dong
    Zhu, Jun
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4513 - 4521
  • [7] A perspective on off-policy evaluation in reinforcement learning
    Lihong Li
    Frontiers of Computer Science, 2019, 13 : 911 - 912
  • [8] Reliable Off-Policy Evaluation for Reinforcement Learning
    Wang, Jie
    Gao, Rui
    Zha, Hongyuan
    OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
  • [9] Sequential Search with Off-Policy Reinforcement Learning
    Miao, Dadong
    Wang, Yanan
    Tang, Guoyu
    Liu, Lin
    Xu, Sulong
    Long, Bo
    Xiao, Yun
    Wu, Lingfei
    Jiang, Yunjiang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4006 - 4015
  • [10] Representations for Stable Off-Policy Reinforcement Learning
    Ghosh, Dibya
    Bellemare, Marc G.
    25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019), 2019,