Provably Efficient Offline Reinforcement Learning With Trajectory-Wise Reward

被引:0
|
作者
Xu, Tengyu [1 ,2 ]
Wang, Yue [3 ]
Zou, Shaofeng [4 ]
Liang, Yingbin [1 ]
机构
[1] Ohio State Univ, Dept Elect & Comp Engn, Columbus, OH 43210 USA
[2] GenAI Meta AI Team, Meta Platforms, Menlo Pk, CA 94025 USA
[3] Univ Cent Florida, Dept Elect & Comp Engn, Orlando, FL 32816 USA
[4] Univ Buffalo, State Univ New York, Dept Elect Engn, Buffalo, NY 14228 USA
基金
美国国家科学基金会;
关键词
Trajectory; Kernel; Standards; Optimization; Markov decision processes; Function approximation; Vectors; Linear Markov decision processes (MDPs); neural networks; function approximation; reward redistribution; pessimistic principle;
D O I
10.1109/TIT.2024.3427141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the trajectory-wise reward. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. We first show that our PARTED achieves an O(dH(3)/root N) suboptimality for linear MDPs, where d is the dimension of the feature, H is the episode length, and N is the size of the offline dataset. We further extend our algorithm and results to general large-scale episodic MDPs with neural network function approximation. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
引用
收藏
页码:6481 / 6518
页数:38
相关论文
共 50 条
  • [1] Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning
    Seo, Younggyo
    Lee, Kimin
    Clavera, Ignasi
    Kurutach, Thanard
    Shin, Jinwoo
    Abbeel, Pieter
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [2] Is Pessimism Provably Efficient for Offline Reinforcement Learning?
    Jin, Ying
    Yang, Zhuoran
    Wang, Zhaoran
    MATHEMATICS OF OPERATIONS RESEARCH, 2024,
  • [3] Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus
    Cui, Qiwen
    Du, Simon S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] Provably Efficient Offline Reinforcement Learning in Regular Decision Processes
    Cipollone, Roberto
    Jonsson, Anders
    Ronca, Alessandro
    Talebi, Mohammad Sadegh
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning
    Kong, Dingwen
    Yang, Lin F.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [6] Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes
    Guo, Hongyi
    Cai, Qi
    Zhang, Yufeng
    Yang, Zhuoran
    Wang, Zhaoran
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [7] On Efficient Sampling in Offline Reinforcement Learning
    Jia, Qing-Shan
    2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 1 - 6
  • [8] Provably Safe Reinforcement Learning with Step-wise Violation Constraints
    Xiong, Nuoya
    Du, Yihan
    Huang, Longbo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage
    Blanchet, Jose
    Lu, Miao
    Zhang, Tong
    Zhong, Han
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
    Feng, Fei
    Wang, Ruosong
    Yin, Wotao
    Du, Simon S.
    Yang, Lin F.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33