Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness

被引:0
作者
Wen, Xiaoyu [1 ]
Yu, Xudong [2 ]
Yang, Rui [3 ]
Chen, Haoyuan [1 ]
Bai, Chenjia [4 ,5 ]
Wang, Zhen [1 ]
机构
[1] Northwestern Polytech Univ, Xian, Shaanxi, Peoples R China
[2] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China
[3] Hong Kong Univ Sci & Technol HongKong, Hong Kong, Peoples R China
[4] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[5] Northwestern Polytech Univ, Shenzhen Res Inst, Shenzhen, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To obtain a near-optimal policy with fewer interactions in Reinforcement Learning (RL), a promising approach involves the combination of offiine RL, which enhances sample efficiency by leveraging offiine datasets, and online RL, which explores informative transitions by interacting with the environment. Offiine-to-Online RL provides a paradigm for improving an offiine-trained agent within limited online interactions. However, due to the significant distribution shift between online experiences and offiine data, most offiine RL algorithms suffer from performance drops and fail to achieve stable policy improvement in offiine-to-online adaptation. To address this problem, we propose the Robust Offiineto-Online (RO2O) algorithm, designed to enhance offiine policies through uncertainty and smoothness, and to mitigate the performance drop in online adaptation. Specifically, RO2O incorporates Q-ensemble for uncertainty penalty and adversarial samples for policy and value smoothness, which enable RO2O to maintain a consistent learning procedure in online adaptation without requiring special changes to the learning objective. Theoretical analyses in linear MDPs demonstrate that the uncertainty and smoothness lead to tighter optimality bound in offiine-to-online against distribution shift. Experimental results illustrate the superiority of RO2O in facilitating stable offiine-to-online learning and achieving significant improvement with limited online interactions. (c) 2024 The Authors.
引用
收藏
页码:481 / 509
页数:29
相关论文
共 47 条
[1]  
An G., 2021, ADV NEURAL INFORM PR, V34, P7436
[2]  
Bai C., 2022, INT C LEARN REPR
[3]  
Berner Christopher, 2019, ABS191206680 CORR
[4]  
Chen X., 2021, INT C LEARN REPR
[5]  
Fu J., 2020, ABS200407219 CORR
[6]  
Fujimoto S, 2021, ADV NEUR IN, V34
[7]  
Fujimoto S, 2019, PR MACH LEARN RES, V97
[8]  
Fujimoto S, 2018, PR MACH LEARN RES, V80
[9]  
Ghasemipour SKS, 2022, ADV NEUR IN
[10]  
Haarnoja T, 2018, PR MACH LEARN RES, V80