Off-Policy Evaluation via Off-Policy Classification

被引：0

作者：

Irpan, Alex ^{[1
]}

Rao, Kanishka ^{[1
]}

Bousmalis, Konstantinos ^{[2
]}

Harris, Chris ^{[1
]}

Ibarz, Julian ^{[1
]}

Levine, Sergey ^{[1
,3
]}

机构：

[1] Google Brain, Mountain View, CA 94043 USA

[2] DeepMind, London, England

[3] Univ Calif Berkeley, Berkeley, CA 94720 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷

关键词：

ARCADE LEARNING-ENVIRONMENT;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.

引用

页数：12

共 50 条

[1] Off-Policy Evaluation via the Regularized Lagrangian
Yang, Mengjiao
Nachum, Ofir
Dai, Bo
Li, Lihong
Schuurmans, Dale
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[2] Universal Off-Policy Evaluation
Chandak, Yash
Niekum, Scott
da Silva, Bruno Castro
Learned-Miller, Erik
Brunskill, Emma
Thomas, Philip S.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[3] Off-Policy Evaluation for Human Feedback
Gao, Qitong
Gao, Ge
Dong, Juncheng
Tarokh, Vahid
Chi, Min
Pajic, Miroslav
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[4] Off-policy evaluation for slate recommendation
Swaminathan, Adith
Krishnamurthy, Akshay
Agarwal, Alekh
Dudik, Miroslav
Langford, John
Jose, Damien
Zitouni, Imed
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[5] High Confidence Off-Policy Evaluation
Thomas, Philip S.
Theocharous, Georgios
Ghavamzadeh, Mohammad
PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 3000 - 3006
[6] State Relevance for Off-Policy Evaluation
Shen, Simon P.
Ma, Yecheng Jason
Gottesman, Omer
Doshi-Velez, Finale
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[7] Evaluating the Robustness of Off-Policy Evaluation
Saito, Yuta
Udagawa, Takuma
Kiyohara, Haruka
Mogi, Kazuki
Narita, Yusuke
Tateno, Kei
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 114 - 123
[8] Off-Policy Proximal Policy Optimization
Meng, Wenjia
Zheng, Qian
Pan, Gang
Yin, Yilong
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9162 - 9170
[9] A Nonparametric Off-Policy Policy Gradient
Tosatto, Samuele
Carvalho, Joao
Abdulsamad, Hany
Peters, Jan
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
[10] Representation Balancing MDPs for Off-Policy Policy Evaluation
Liu, Yao
Gottesman, Omer
Raghu, Aniruddh
Komorowski, Matthieu
Faisal, Aldo
Doshi-Velez, Finale
Brunskill, Emma
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31

← 1 2 3 4 5 →