Collaborative promotion: Achieving safety and task performance by integrating imitation reinforcement learning

被引：0

作者：

Zhang, Cai ^{[1
]}

Zhang, Xiaoxiong ^{[2
,3
]}

Zhang, Hui ^{[2
,3
]}

Zhu, Fei ^{[1
]}

机构：

[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China

[2] Natl Univ Def Technol, Sixty Res Inst 3, Nanjing 210007, Peoples R China

[3] Natl Univ Def Technol, Lab big data & decis, Changsha 410073, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 255卷

基金：

中国国家自然科学基金;

关键词：

Safe reinforcement learning; Imitation learning; Dual policy networks; Multi-objective optimization; Loose coupling;

D O I：

10.1016/j.eswa.2024.124820

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Although the importance of safety is self-evident for artificial intelligence, like the two sides of a coin, excessively focusing on safety performance without considering task performance may cause the agent to become conservative and thus hesitant. How to make a balance between safety and task performance has been a pressing concern. To address this issue, we introduce Collaborative Promotion (CP) that is designed to harmonize safety and task objectives, thereby enabling a loosely coupled optimization of dual objectives. CP is a novel dual-policy framework where the safety and task objectives are assigned to the safety policy framework and task policy framework, respectively, as their primary goals. The actor-critic framework is constructed using the value function to guide the enhancement of these primary objectives. With the aid of imitation learning, secondary objective optimization is achieved through behavioral cloning, with each framework considering the other as an expert in its domain. The safety policy framework employs a weighted sum method for multi- objective optimization, establishing a primary-secondary relationship to facilitate loosely coupled optimization of safety and task objectives. In the realms of Safe Navigation and Safe Velocity, we have benchmarked CP against task-specific and safety-specific algorithms. Extensive experiments demonstrate that CP achieves the intended goals.

引用

页数：12

共 36 条

[1]

Achiam J., 2021, Exploration and safety in deep reinforcement learning

[2]

Altman E., 1999, Constrained Markov Decision Processes, V7, DOI 10.1201/9781315140223

[3]

Carr S, 2023, AAAI CONF ARTIF INTE, P14748

[4] Off-Policy Actor-critic for Recommender Systems [J].

Chen, Minmin ;

Xu, Can ;

Gatto, Vince ;

Jain, Devanshu ;

Kumar, Aviral ;

Chi, Ed .

PROCEEDINGS OF THE 16TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2022, 2022, :338-349

[5]

Cheng CA, 2022, PR MACH LEARN RES

[6]

Chow Y, 2018, J MACH LEARN RES, V18

[7]

Dai JT, 2023, AAAI CONF ARTIF INTE, P7288

[8] SOME IMPLEMENTATIONS OF THE BOXPLOT [J].

FRIGGE, M ;

HOAGLIN, DC ;

IGLEWICZ, B .

AMERICAN STATISTICIAN, 1989, 43 (01) :50-54

[9]

Fujimoto S, 2021, ADV NEUR IN, V34

[10]

García J, 2015, J MACH LEARN RES, V16, P1437

← 1 2 3 4 →