Provably Efficient Offline Reinforcement Learning With Trajectory-Wise Reward

被引:0
|
作者
Xu, Tengyu [1 ,2 ]
Wang, Yue [3 ]
Zou, Shaofeng [4 ]
Liang, Yingbin [1 ]
机构
[1] Ohio State Univ, Dept Elect & Comp Engn, Columbus, OH 43210 USA
[2] GenAI Meta AI Team, Meta Platforms, Menlo Pk, CA 94025 USA
[3] Univ Cent Florida, Dept Elect & Comp Engn, Orlando, FL 32816 USA
[4] Univ Buffalo, State Univ New York, Dept Elect Engn, Buffalo, NY 14228 USA
基金
美国国家科学基金会;
关键词
Trajectory; Kernel; Standards; Optimization; Markov decision processes; Function approximation; Vectors; Linear Markov decision processes (MDPs); neural networks; function approximation; reward redistribution; pessimistic principle;
D O I
10.1109/TIT.2024.3427141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the trajectory-wise reward. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. We first show that our PARTED achieves an O(dH(3)/root N) suboptimality for linear MDPs, where d is the dimension of the feature, H is the episode length, and N is the size of the offline dataset. We further extend our algorithm and results to general large-scale episodic MDPs with neural network function approximation. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
引用
收藏
页码:6481 / 6518
页数:38
相关论文
共 50 条
  • [41] Improved PAM-based Traffic Behavior Recognition Using Trajectory-Wise Features
    Huynh-The, Thien
    Bui, Dinh-Mao
    Lee, Sungyoung
    Yoon, Yongik
    2016 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2016, : 257 - 260
  • [42] Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
    Zhong, Han
    Xiong, Wei
    Tan, Jiyuan
    Wang, Liwei
    Zhang, Tong
    Wang, Zhaoran
    Yang, Zhuoran
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [43] Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning
    Xie, Jiexin
    Shao, Zhenzhou
    Li, Yue
    Guan, Yong
    Tan, Jindong
    IEEE ACCESS, 2019, 7 : 105669 - 105679
  • [44] Data-Efficient Offline Reinforcement Learning with Approximate Symmetries
    Angelotti, Giorgio
    Drougard, Nicolas
    Chanel, Caroline P. C.
    AGENTS AND ARTIFICIAL INTELLIGENCE, ICAART 2023, 2024, 14546 : 164 - 186
  • [45] Reinforcement learning via offline trajectory planning based on iteratively approximated models
    Pritzkoleit, Max
    Heedt, Robert
    Knoll, Carsten
    Roebenack, Klaus
    AT-AUTOMATISIERUNGSTECHNIK, 2020, 68 (08) : 612 - 624
  • [46] Efficient Reinforcement Learning via Probabilistic Trajectory Optimization
    Pan, Yunpeng
    Boutselis, George, I
    Theodorou, Evangelos A.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (11) : 5459 - 5474
  • [47] Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
    Mao, Weichao
    Basar, Tamer
    DYNAMIC GAMES AND APPLICATIONS, 2023, 13 (01) : 165 - 186
  • [48] Provably Efficient Multi-Agent Reinforcement Learning with Fully Decentralized Communication
    Lidard, Justin
    Madhushani, Udari
    Leonard, Naomi Ehrich
    2022 AMERICAN CONTROL CONFERENCE, ACC, 2022, : 3311 - 3316
  • [49] Provably Efficient Reinforcement Learning with Linear Function Approximation under Adaptivity Constraints
    Wang, Tianhao
    Zhou, Dongruo
    Gu, Quanquan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [50] Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
    Weichao Mao
    Tamer Başar
    Dynamic Games and Applications, 2023, 13 : 165 - 186