Dual Behavior Regularized Offline Deterministic Actor-Critic

被引:3
作者
Cao, Shuo [1 ,2 ]
Wang, Xuesong [1 ,2 ]
Cheng, Yuhu [1 ,2 ]
机构
[1] China Univ Min & Technol, Engn Res Ctr Intelligent Control Underground Space, Minist Educ, Xuzhou 221116, Peoples R China
[2] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Peoples R China
来源
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS | 2024年 / 54卷 / 08期
基金
中国国家自然科学基金;
关键词
Anti-exploration behavior value; dual behavior regularization (DBR); mild-local behavior cloning (BC); offline deterministic Actor-Critic; reinforcement learning (RL);
D O I
10.1109/TSMC.2024.3388007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To mitigate the extrapolation error arising from offline reinforcement learning (RL) paradigm, existing methods typically make learned Q-functions over-conservative or enforce global policy constraints. In this article, we propose a dual behavior regularized offline deterministic Actor-Critic (DBRAC) by simultaneously performing behavior regularization on the coupling-iterative policy evaluation (PE) and policy improvement (PI) in the policy iteration process. In the PE phase, the difference between the Q-function and behavior value is first taken as the anti-exploration behavior value regularization term to drive the Q-function toward its true Q-value, which significantly reduces the conservatism of learned Q-function. In the PI phase, the estimated action variances of behavior policy in different states are then utilized for designing the weight and threshold of mild-local behavior cloning regularization term, which standardizes the local improvement potential of learned policy. Experiments on the well-known datasets for deep data-driven RL (D4RL) demonstrate that the DBRAC can quickly learn more competitive task-solving policies in various offline situations with different data qualities, significantly outperforming state-of-the-art offline RL baselines.
引用
收藏
页码:4841 / 4852
页数:12
相关论文
共 50 条
[1]   TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets [J].
Cai, Yuanying ;
Zhang, Chuheng ;
Zhao, Li ;
Shen, Wei ;
Zhang, Xuyun ;
Song, Lei ;
Bian, Jiang ;
Qin, Tao ;
Liu, Tieyan .
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, :21-30
[2]   A Hierarchical Deep Reinforcement Learning Framework for 6-DOF UCAV Air-to-Air Combat [J].
Chai, Jiajun ;
Chen, Wenzhang ;
Zhu, Yuanheng ;
Yao, Zong-Xin ;
Zhao, Dongbin .
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2023, 53 (09) :5417-5429
[3]  
Chen LC, 2023, 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), P13512
[4]  
Dadashi R, 2021, PR MACH LEARN RES, V139
[5]   Deep Reinforcement Learning From Demonstrations to Assist Service Restoration in Islanded Microgrids [J].
Du, Yan ;
Wu, Di .
IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, 2022, 13 (02) :1062-1072
[6]  
Farahmand A. M., 2010, ADV NEURAL INFORM PR
[7]  
Fu A., 2020, ARXIV
[8]  
Fujimoto S, 2021, ADV NEUR IN, V34
[9]  
Fujimoto S, 2019, PR MACH LEARN RES, V97
[10]  
Fujimoto S, 2018, PR MACH LEARN RES, V80