Supported Value Regularization for Offline Reinforcement Learning

被引：0

作者：

Mao, Yixiu ^{[1
]}

Zhang, Hongchang ^{[1
]}

Chen, Chen ^{[1
]}

Xu, Yi ^{[2
]}

Ji, Xiangyang ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

[2] Dalian Univ Technol, Sch Artificial Intelligence, Dalian, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Offline reinforcement learning suffers from the extrapolation error and value overestimation caused by out-of-distribution (OOD) actions. To mitigate this issue, value regularization approaches aim to penalize the learned value functions to assign lower values to OOD actions. However, existing value regularization methods lack a proper distinction between the regularization effects on in-distribution (ID) and OOD actions, and fail to guarantee optimal convergence results of the policy. To this end, we propose Supported Value Regularization (SVR), which penalizes the Q-values for all OOD actions while maintaining standard Bellman updates for ID ones. Specifically, we utilize the bias of importance sampling to compute the summation of Q-values over the entire OOD region, which serves as the penalty for policy evaluation. This design automatically separates the regularization for ID and OOD actions without manually distinguishing between them. In tabular MDP, we show that the policy evaluation operator of SVR is a contraction, whose fixed point outputs unbiased Q-values for ID actions and underestimated Q-values for OOD actions. Furthermore, the policy iteration with SVR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Empirically, we validate the theoretical properties of SVR in a tabular maze environment and demonstrate its state-of-the-art performance on a range of continuous control tasks in the D4RL benchmark.

引用

页数：23

共 52 条

[1]

An G, 2021, ADV NEUR IN

[2]

[Anonymous], 2014, PMLR

[3]

[Anonymous], 2018, PMLR

[4]

[Anonymous], 2019, PMLR

[5]

[Anonymous], 2019, REINFORCEMENT LEARNI, DOI DOI 10.1007/978-981-13-2224-23

[6]

[Anonymous], 2015, ACS SYM SER

[7]

[Anonymous], 1989, ALVINN: An Autonomous Land Vehicle in a Neural Network, page

[8]

Bai C., 2022, INT C LEARN REPR

[9]

Brandfonbrener D, 2021, ADV NEUR IN, V34

[10]

Chen X., 2020, ADV NEURAL INFORM PR, P18353, DOI DOI 10.48550/ARXIV.1910.12179

← 1 2 3 4 5 6 →