Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

被引:7
作者
Bennett, Andrew [1 ]
Kallus, Nathan [1 ]
机构
[1] Cornell Univ, Cornell Tech, New York, NY 10044 USA
基金
美国国家科学基金会;
关键词
offline reinforcement learning; unmeasured confounding; semiparametric efficiency;
D O I
10.1287/opre.2021.0781
中图分类号
C93 [管理学];
学科分类号
12 ; 1201 ; 1202 ; 120202 ;
摘要
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in an unknown POMDP given observations of trajectories with only partial state observations and generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We term the resulting framework proximal reinforcement learning (PRL). We then show how to construct estimators in these settings and prove they are semiparametrically efficient. We demonstrate the benefits of PRL in an extensive simulation study and on the problem of sepsis management.
引用
收藏
页码:1071 / 1086
页数:16
相关论文
共 40 条
[1]  
Azizzadenesheli K., 2016, C LEARNING THEORY, V49, P1
[2]  
Bennett A, 2021, PR MACH LEARN RES, V130
[3]   The variational method of moments [J].
Bennett, Andrew ;
Kallus, Nathan .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2023, 85 (03) :810-841
[4]   Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration With Application to Autonomous Sequential Repair Problems [J].
Bhattacharya, Sushmita ;
Badyal, Sahil ;
Wheeler, Thomas ;
Gil, Stephanie ;
Bertsekas, Dimitri .
IEEE ROBOTICS AND AUTOMATION LETTERS, 2020, 5 (03) :3967-3974
[5]  
Carrasco M, 2007, HBK ECON, V2, P5633, DOI 10.1016/S1573-4412(07)06077-1
[6]  
Chandak Yash, 2021, Advances in Neural Information Processing Systems
[7]   Estimation of Nonparametric Conditional Moment Models With Possibly Nonsmooth Generalized Residuals [J].
Chen, Xiaohong ;
Pouzo, Demian .
ECONOMETRICA, 2012, 80 (01) :277-321
[8]   Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals [J].
Chen, Xiaohong ;
Pouzo, Demian .
JOURNAL OF ECONOMETRICS, 2009, 152 (01) :46-60
[9]   Double/debiased machine learning for treatment and structural parameters [J].
Chernozhukov, Victor ;
Chetverikov, Denis ;
Demirer, Mert ;
Duflo, Esther ;
Hansen, Christian ;
Newey, Whitney ;
Robins, James .
ECONOMETRICS JOURNAL, 2018, 21 (01) :C1-C68
[10]  
Cui YF, 2022, Arxiv, DOI arXiv:2011.08411