PROJECTED STATE-ACTION BALANCING WEIGHTS FOR OFFLINE REINFORCEMENT LEARNING

被引:1
|
作者
Wang, Jiayi [1 ]
Qi, Zhengling [2 ]
Wong, Raymond K. W. [3 ]
机构
[1] Univ Texas Dallas, Dept Math Sci, Richardson, TX 75083 USA
[2] George Washington Univ, Dept Decis Sci, Washington, DC 20052 USA
[3] Texas A&M Univ, Dept Stat, College Stn, TX 77843 USA
基金
美国国家科学基金会;
关键词
Infinite horizons; Markov decision process; Policy evaluation; Reinforcement learning; DYNAMIC TREATMENT REGIMES; RATES; CONVERGENCE; INFERENCE;
D O I
10.1214/23-AOS2302
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
引用
收藏
页码:1639 / 1665
页数:27
相关论文
共 50 条
  • [41] REINFORCEMENT LEARNING WITH RESTRICTIONS ON THE ACTION SET
    Bravo, Mario
    Faure, Mathieu
    SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2015, 53 (01) : 287 - 312
  • [42] Experiments with reinforcement learning in problems with continuous state and action spaces
    Santamaria, JC
    Sutton, RS
    Ram, A
    ADAPTIVE BEHAVIOR, 1997, 6 (02) : 163 - 217
  • [43] Learning Personalized Health Recommendations via Offline Reinforcement Learning
    Preuett, Larry
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 1355 - 1357
  • [44] Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism
    Rashidinejad, Paria
    Zhu, Banghua
    Ma, Cong
    Jiao, Jiantao
    Russell, Stuart
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (12) : 8156 - 8196
  • [45] Learning a Model-Free Robotic Continuous State-Action Task through Contractive Q-Network
    Davari, Mohammadjavad
    Alipour, Khalil
    Hadi, Alireza
    Tarvirdizadeh, Bahram
    2017 ARTIFICIAL INTELLIGENCE AND ROBOTICS (IRANOPEN), 2017, : 115 - 120
  • [46] Doubly constrained offline reinforcement learning for learning path recommendation
    Yun, Yue
    Dai, Huan
    An, Rui
    Zhang, Yupei
    Shang, Xuequn
    KNOWLEDGE-BASED SYSTEMS, 2024, 284
  • [47] Low-rank State-action Value-function Approximation
    Rozada, Sergio
    Tenorio, Victor
    Marques, Antonio G.
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 1471 - 1475
  • [48] Estimation of the Change of Agents Behavior Strategy Using State-Action History
    Uchida, Shihori
    Oba, Sigeyuki
    Ishii, Shin
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, PT II, 2017, 10614 : 100 - 107
  • [49] Compressive Features in Offline Reinforcement Learning for Recommender Systems
    Minh Pham
    Hung Nguyen
    Long Dang
    Nieves, Jennifer Adorno
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5719 - 5726
  • [50] Integrating Offline Reinforcement Learning with Transformers for Sequential Recommendation
    Xi, Xumei
    Zhao, Yuke
    Liu, Quan
    Ouyang, Liwen
    Wu, Yang
    PROCEEDINGS OF THE 17TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2023, 2023, : 1103 - 1108