Universal Off-Policy Evaluation

被引:0
作者
Chandak, Yash [1 ]
Niekum, Scott [2 ]
da Silva, Bruno Castro [1 ]
Learned-Miller, Erik [1 ]
Brunskill, Emma [3 ]
Thomas, Philip S. [1 ]
机构
[1] Univ Massachusetts, Amherst, MA 01003 USA
[2] Univ Texas Austin, Austin, TX 78712 USA
[3] Stanford Univ, Stanford, CA 94305 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年
基金
美国国家科学基金会;
关键词
MARKOV DECISION-PROCESSES; WILD BOOTSTRAP; VARIANCE; INFERENCE; BOUNDS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a universal off-policy estimator (UnO)-one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss UnO's applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.
引用
收藏
页数:16
相关论文
共 105 条
[1]   On the coherence of expected shortfall [J].
Acerbi, C ;
Tasche, D .
JOURNAL OF BANKING & FINANCE, 2002, 26 (07) :1487-1503
[2]  
Altschuler J, 2019, J MACH LEARN RES, V20
[3]  
Anderson T. W., 1969, Technical report
[4]  
[Anonymous], 2017, Advances in Neural Information Processing Systems
[5]  
[Anonymous], 1990, Handbooks in Operations Research and Management Science, DOI [10.1016/S0927-0507(05)80172-0, DOI 10.1016/S0927-0507(05)80172-0]
[6]   Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model [J].
Azar, Mohammad Gheshlaghi ;
Munos, Remi ;
Kappen, Hilbert J. .
MACHINE LEARNING, 2013, 91 (03) :325-349
[7]  
Barto A, 2017, P 18 YAL WORKSH AD L
[8]  
Bastani M., 2014, THESIS
[9]  
Bellemare MG, 2017, PR MACH LEARN RES, V70
[10]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300