Model-Based Offline Policy Optimization with Distribution Correcting Regularization

被引:1
作者
Shen, Jian [1 ]
Chen, Mingcheng [1 ]
Zhang, Zhicheng [1 ]
Yang, Zhengyu [1 ]
Zhang, Weinan [1 ]
Yu, Yong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
来源
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES | 2021年 / 12975卷
基金
中国国家自然科学基金;
关键词
Offline Reinforcement Learning; Model-based; Reinforcement Learning; Occupancy measure;
D O I
10.1007/978-3-030-86486-6_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Offline Reinforcement Learning (RL) aims at learning effective policies by leveraging previously collected datasets without further exploration in environments. Model-based algorithms, which first learn a dynamics model using the offline dataset and then conservatively learn a policy under the model, have demonstrated great potential in offline RL. Previous model-based algorithms typically penalize the rewards with the uncertainty of the dynamics model, which, however, is not necessarily consistent with the model error. Inspired by the lower bound on the return in the real dynamics, in this paper we present a model-based alternative called DROP for offline RL. In particular, DROP estimates the density ratio between model-rollouts distribution and offline data distribution via the DICE framework [45], and then regularizes the model-predicted rewards with the ratio for pessimistic policy learning. Extensive experiments show our DROP can achieve comparable or better performance compared to baselines on widely studied offline RL benchmarks.
引用
收藏
页码:174 / 189
页数:16
相关论文
共 46 条
  • [1] Agarwal R., 2019, STRIVING SIMPLICITY
  • [2] Chen HK, 2019, AAAI CONF ARTIF INTE, P3312
  • [3] Top-K Off-Policy Correction for a REINFORCE Recommender System
    Chen, Minmin
    Beutel, Alex
    Covington, Paul
    Jain, Sagar
    Belletti, Francois
    Chi, Ed H.
    [J]. PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 456 - 464
  • [4] Chua K, 2018, ADV NEUR IN, V31
  • [5] Deep Neural Networks for YouTube Recommendations
    Covington, Paul
    Adams, Jay
    Sargin, Emre
    [J]. PROCEEDINGS OF THE 10TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS'16), 2016, : 191 - 198
  • [6] Fu Justin, 2020, ARXIV200407219
  • [7] Fujimoto S, 2019, PR MACH LEARN RES, V97
  • [8] Gottesman O., 2018, EVALUATING REINFORCE
  • [9] Gretton A, 2012, J MACH LEARN RES, V13, P723
  • [10] Haarnoja T, 2018, PR MACH LEARN RES, V80