Off-Policy Evaluation via Off-Policy Classification

被引:0
|
作者
Irpan, Alex [1 ]
Rao, Kanishka [1 ]
Bousmalis, Konstantinos [2 ]
Harris, Chris [1 ]
Ibarz, Julian [1 ]
Levine, Sergey [1 ,3 ]
机构
[1] Google Brain, Mountain View, CA 94043 USA
[2] DeepMind, London, England
[3] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷
关键词
ARCADE LEARNING-ENVIRONMENT;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Off-Policy Evaluation via the Regularized Lagrangian
    Yang, Mengjiao
    Nachum, Ofir
    Dai, Bo
    Li, Lihong
    Schuurmans, Dale
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [2] Universal Off-Policy Evaluation
    Chandak, Yash
    Niekum, Scott
    da Silva, Bruno Castro
    Learned-Miller, Erik
    Brunskill, Emma
    Thomas, Philip S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [3] Off-Policy Evaluation for Human Feedback
    Gao, Qitong
    Gao, Ge
    Dong, Juncheng
    Tarokh, Vahid
    Chi, Min
    Pajic, Miroslav
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Off-policy evaluation for slate recommendation
    Swaminathan, Adith
    Krishnamurthy, Akshay
    Agarwal, Alekh
    Dudik, Miroslav
    Langford, John
    Jose, Damien
    Zitouni, Imed
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [5] High Confidence Off-Policy Evaluation
    Thomas, Philip S.
    Theocharous, Georgios
    Ghavamzadeh, Mohammad
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 3000 - 3006
  • [6] State Relevance for Off-Policy Evaluation
    Shen, Simon P.
    Ma, Yecheng Jason
    Gottesman, Omer
    Doshi-Velez, Finale
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [7] Evaluating the Robustness of Off-Policy Evaluation
    Saito, Yuta
    Udagawa, Takuma
    Kiyohara, Haruka
    Mogi, Kazuki
    Narita, Yusuke
    Tateno, Kei
    15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 114 - 123
  • [8] Off-Policy Proximal Policy Optimization
    Meng, Wenjia
    Zheng, Qian
    Pan, Gang
    Yin, Yilong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9162 - 9170
  • [9] A Nonparametric Off-Policy Policy Gradient
    Tosatto, Samuele
    Carvalho, Joao
    Abdulsamad, Hany
    Peters, Jan
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
  • [10] Representation Balancing MDPs for Off-Policy Policy Evaluation
    Liu, Yao
    Gottesman, Omer
    Raghu, Aniruddh
    Komorowski, Matthieu
    Faisal, Aldo
    Doshi-Velez, Finale
    Brunskill, Emma
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31