Provably Efficient Offline Reinforcement Learning With Trajectory-Wise Reward

被引：0

作者：

Xu, Tengyu ^{[1
,2
]}

Wang, Yue ^{[3
]}

Zou, Shaofeng ^{[4
]}

Liang, Yingbin ^{[1
]}

机构：

[1] Ohio State Univ, Dept Elect & Comp Engn, Columbus, OH 43210 USA

[2] GenAI Meta AI Team, Meta Platforms, Menlo Pk, CA 94025 USA

[3] Univ Cent Florida, Dept Elect & Comp Engn, Orlando, FL 32816 USA

[4] Univ Buffalo, State Univ New York, Dept Elect Engn, Buffalo, NY 14228 USA

来源：

IEEE TRANSACTIONS ON INFORMATION THEORY | 2024年 / 70卷 / 09期

基金：

美国国家科学基金会;

关键词：

Trajectory; Kernel; Standards; Optimization; Markov decision processes; Function approximation; Vectors; Linear Markov decision processes (MDPs); neural networks; function approximation; reward redistribution; pessimistic principle;

D O I：

10.1109/TIT.2024.3427141

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the trajectory-wise reward. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. We first show that our PARTED achieves an O(dH(3)/root N) suboptimality for linear MDPs, where d is the dimension of the feature, H is the episode length, and N is the size of the offline dataset. We further extend our algorithm and results to general large-scale episodic MDPs with neural network function approximation. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.

引用

页码：6481 / 6518

页数：38

共 50 条

[41] Improved PAM-based Traffic Behavior Recognition Using Trajectory-Wise Features
Huynh-The, Thien
Bui, Dinh-Mao
Lee, Sungyoung
Yoon, Yongik
2016 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2016, : 257 - 260
[42] Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
Zhong, Han
Xiong, Wei
Tan, Jiyuan
Wang, Liwei
Zhang, Tong
Wang, Zhaoran
Yang, Zhuoran
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[43] Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning
Xie, Jiexin
Shao, Zhenzhou
Li, Yue
Guan, Yong
Tan, Jindong
IEEE ACCESS, 2019, 7 : 105669 - 105679
[44] Data-Efficient Offline Reinforcement Learning with Approximate Symmetries
Angelotti, Giorgio
Drougard, Nicolas
Chanel, Caroline P. C.
AGENTS AND ARTIFICIAL INTELLIGENCE, ICAART 2023, 2024, 14546 : 164 - 186
[45] Reinforcement learning via offline trajectory planning based on iteratively approximated models
Pritzkoleit, Max
Heedt, Robert
Knoll, Carsten
Roebenack, Klaus
AT-AUTOMATISIERUNGSTECHNIK, 2020, 68 (08) : 612 - 624
[46] Efficient Reinforcement Learning via Probabilistic Trajectory Optimization
Pan, Yunpeng
Boutselis, George, I
Theodorou, Evangelos A.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (11) : 5459 - 5474
[47] Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
Mao, Weichao
Basar, Tamer
DYNAMIC GAMES AND APPLICATIONS, 2023, 13 (01) : 165 - 186
[48] Provably Efficient Multi-Agent Reinforcement Learning with Fully Decentralized Communication
Lidard, Justin
Madhushani, Udari
Leonard, Naomi Ehrich
2022 AMERICAN CONTROL CONFERENCE, ACC, 2022, : 3311 - 3316
[49] Provably Efficient Reinforcement Learning with Linear Function Approximation under Adaptivity Constraints
Wang, Tianhao
Zhou, Dongruo
Gu, Quanquan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[50] Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
Weichao Mao
Tamer Başar
Dynamic Games and Applications, 2023, 13 : 165 - 186

← 1 2 3 4 5 →