Provably Efficient Offline Reinforcement Learning With Trajectory-Wise Reward

被引：0

作者：

Xu, Tengyu ^{[1
,2
]}

Wang, Yue ^{[3
]}

Zou, Shaofeng ^{[4
]}

Liang, Yingbin ^{[1
]}

机构：

[1] Ohio State Univ, Dept Elect & Comp Engn, Columbus, OH 43210 USA

[2] GenAI Meta AI Team, Meta Platforms, Menlo Pk, CA 94025 USA

[3] Univ Cent Florida, Dept Elect & Comp Engn, Orlando, FL 32816 USA

[4] Univ Buffalo, State Univ New York, Dept Elect Engn, Buffalo, NY 14228 USA

来源：

IEEE TRANSACTIONS ON INFORMATION THEORY | 2024年 / 70卷 / 09期

基金：

美国国家科学基金会;

关键词：

Trajectory; Kernel; Standards; Optimization; Markov decision processes; Function approximation; Vectors; Linear Markov decision processes (MDPs); neural networks; function approximation; reward redistribution; pessimistic principle;

D O I：

10.1109/TIT.2024.3427141

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the trajectory-wise reward. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. We first show that our PARTED achieves an O(dH(3)/root N) suboptimality for linear MDPs, where d is the dimension of the feature, H is the episode length, and N is the size of the offline dataset. We further extend our algorithm and results to general large-scale episodic MDPs with neural network function approximation. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.

引用

页码：6481 / 6518

页数：38

共 50 条

[1] Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning
Seo, Younggyo
Lee, Kimin
Clavera, Ignasi
Kurutach, Thanard
Shin, Jinwoo
Abbeel, Pieter
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[2] Is Pessimism Provably Efficient for Offline Reinforcement Learning?
Jin, Ying
Yang, Zhuoran
Wang, Zhaoran
MATHEMATICS OF OPERATIONS RESEARCH, 2024,
[3] Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus
Cui, Qiwen
Du, Simon S.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[4] Provably Efficient Offline Reinforcement Learning in Regular Decision Processes
Cipollone, Roberto
Jonsson, Anders
Ronca, Alessandro
Talebi, Mohammad Sadegh
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning
Kong, Dingwen
Yang, Lin F.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[6] Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes
Guo, Hongyi
Cai, Qi
Zhang, Yufeng
Yang, Zhuoran
Wang, Zhaoran
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[7] On Efficient Sampling in Offline Reinforcement Learning
Jia, Qing-Shan
2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 1 - 6
[8] Provably Safe Reinforcement Learning with Step-wise Violation Constraints
Xiong, Nuoya
Du, Yihan
Huang, Longbo
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[9] Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage
Blanchet, Jose
Lu, Miao
Zhang, Tong
Zhong, Han
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[10] Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
Feng, Fei
Wang, Ruosong
Yin, Wotao
Du, Simon S.
Yang, Lin F.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33

← 1 2 3 4 5 →