Dual Behavior Regularized Offline Deterministic Actor-Critic

被引：3

作者：

Cao, Shuo ^{[1
,2
]}

Wang, Xuesong ^{[1
,2
]}

Cheng, Yuhu ^{[1
,2
]}

机构：

[1] China Univ Min & Technol, Engn Res Ctr Intelligent Control Underground Space, Minist Educ, Xuzhou 221116, Peoples R China

[2] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Peoples R China

来源：

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS | 2024年 / 54卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Anti-exploration behavior value; dual behavior regularization (DBR); mild-local behavior cloning (BC); offline deterministic Actor-Critic; reinforcement learning (RL);

D O I：

10.1109/TSMC.2024.3388007

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

To mitigate the extrapolation error arising from offline reinforcement learning (RL) paradigm, existing methods typically make learned Q-functions over-conservative or enforce global policy constraints. In this article, we propose a dual behavior regularized offline deterministic Actor-Critic (DBRAC) by simultaneously performing behavior regularization on the coupling-iterative policy evaluation (PE) and policy improvement (PI) in the policy iteration process. In the PE phase, the difference between the Q-function and behavior value is first taken as the anti-exploration behavior value regularization term to drive the Q-function toward its true Q-value, which significantly reduces the conservatism of learned Q-function. In the PI phase, the estimated action variances of behavior policy in different states are then utilized for designing the weight and threshold of mild-local behavior cloning regularization term, which standardizes the local improvement potential of learned policy. Experiments on the well-known datasets for deep data-driven RL (D4RL) demonstrate that the DBRAC can quickly learn more competitive task-solving policies in various offline situations with different data qualities, significantly outperforming state-of-the-art offline RL baselines.

引用

页码：4841 / 4852

页数：12

共 50 条

[1] TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets [J].

Cai, Yuanying ;

Zhang, Chuheng ;

Zhao, Li ;

Shen, Wei ;

Zhang, Xuyun ;

Song, Lei ;

Bian, Jiang ;

Qin, Tao ;

Liu, Tieyan .

2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, :21-30

[2] A Hierarchical Deep Reinforcement Learning Framework for 6-DOF UCAV Air-to-Air Combat [J].

Chai, Jiajun ;

Chen, Wenzhang ;

Zhu, Yuanheng ;

Yao, Zong-Xin ;

Zhao, Dongbin .

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2023, 53 (09) :5417-5429

[3]

Chen LC, 2023, 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), P13512

[4]

Dadashi R, 2021, PR MACH LEARN RES, V139

[5] Deep Reinforcement Learning From Demonstrations to Assist Service Restoration in Islanded Microgrids [J].

Du, Yan ;

Wu, Di .

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, 2022, 13 (02) :1062-1072

[6]

Farahmand A. M., 2010, ADV NEURAL INFORM PR

[7]

Fu A., 2020, ARXIV

[8]

Fujimoto S, 2021, ADV NEUR IN, V34

[9]

Fujimoto S, 2019, PR MACH LEARN RES, V97

[10]

Fujimoto S, 2018, PR MACH LEARN RES, V80

← 1 2 3 4 5 →