RORL: Robust Offline Reinforcement Learning via Conservative Smoothing

被引：0

作者：

Yang, Rui ^{[1
]}

Bai, Chenjia ^{[2
]}

Ma, Xiaoteng ^{[3
]}

Wang, Zhaoran ^{[4
]}

Zhang, Chongjie ^{[3
]}

Han, Lei ^{[5
]}

机构：

[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[2] Shanghai AI Lab, Shanghai, Peoples R China

[3] Tsinghua Univ, Beijing, Peoples R China

[4] Northwestern Univ, Evanston, IL USA

[5] Tencent Robot X, Shenzhen, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

GO;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Offline reinforcement learning (RL) provides a promising direction to exploit massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative in value estimation and action selection. However, such conservatism can impair the robustness of learned policies when encountering observation deviation under realistic conditions, such as sensor errors and adversarial attacks. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset, as well as additional conservative value estimation on these states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbations.

引用

页数：16

共 75 条

[1]

Abbasi-Yadkori Y., 2011, ADV NEURAL INFORM PR, P2312, DOI DOI 10.5555/2986459.2986717

[2]

An GY, 2021, ADV NEUR IN, V34

[3]

[Anonymous], 2016, arXiv preprint arXiv:1602.02697

[4]

Azar MG, 2017, PR MACH LEARN RES, V70

[5]

Bai C., 2021, Advances in Neural Information Processing Systems, V34, P17007

[6]

Bai C., 2022, INT C LEARN REPR

[7]

Bai Chenjia, 2021, PR MACH LEARN RES, V139

[8]

Basar T., 2008, H INFINITY OPTIMAL C, DOI [10.1007/978-0-8176-4757-5, DOI 10.1007/978-0-8176-4757-5]

[9]

Behzadan Vahid, 2017, Machine Learning and Data Mining in Pattern Recognition. 13th International Conference, MLDM 2017. Proceedings: LNAI 10358, P262, DOI 10.1007/978-3-319-62416-7_19

[10]

Chen J., 2019, P INT C MACH LEARN L, P1042

← 1 2 3 4 5 6 7 8 →