A Prioritized objective actor-critic method for deep reinforcement learning

被引：11

作者：

Nguyen, Ngoc Duy ^{[1
]}

Nguyen, Thanh Thi ^{[2
]}

Vamplew, Peter ^{[3
]}

Dazeley, Richard ^{[4
]}

Nahavandi, Saeid ^{[1
]}

机构：

[1] Deakin Univ, Inst Intelligent Syst Res & Innovat, Waurn Ponds Campus, Geelong, Vic, Australia

[2] Deakin Univ, Sch Informat Technol, Burwood Campus, Melbourne, Vic, Australia

[3] Federat Univ Australia, Federat Learning Agents Grp, Sch Sci Engn & Informat Technol, Ballarat, Vic, Australia

[4] Deakin Univ, Sch Informat Technol, Waurn Ponds Campus, Geelong, Vic, Australia

来源：

NEURAL COMPUTING & APPLICATIONS | 2021年 / 33卷 / 16期

关键词：

Deep learning; Reinforcement learning; Learning systems; Multi-objective optimization; Actor-critic architecture;

D O I：

10.1007/s00521-021-05795-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

An increasing number of complex problems have naturally posed significant challenges in decision-making theory and reinforcement learning practices. These problems often involve multiple conflicting reward signals that inherently cause agents' poor exploration in seeking a specific goal. In extreme cases, the agent gets stuck in a sub-optimal solution and starts behaving harmfully. To overcome such obstacles, we introduce two actor-critic deep reinforcement learning methods, namely Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), which can adjust agent behaviors to efficiently achieve a designated goal by adopting a weighted-sum scalarization of different objective functions. In particular, MCSP creates a human-centric policy that corresponds to a predefined priority weight of different objectives. Whereas, SCMP is capable of generating a mixed policy based on a set of priority weights, i.e., the generated policy uses the knowledge of different policies (each policy corresponds to a priority weight) to dynamically prioritize objectives in real time. We examine our methods by using the Asynchronous Advantage Actor-Critic (A3C) algorithm to utilize the multithreading mechanism for dynamically balancing training intensity of different policies into a single network. Finally, simulation results show that MCSP and SCMP significantly outperform A3C with respect to the mean of total rewards in two complex problems: Food Collector and Seaquest.

引用

页码：10335 / 10349

页数：15

共 37 条

[1]

Amodei D., 2016, ARXIV160606565CS

[2]

[Anonymous], 2016, ICLR

[3]

[Anonymous], 2005, Advances in neural information processing systems, DOI DOI 10.21236/ADA440280

[4]

[Anonymous], 2015, ARXIV151106581CS

[5]

[Anonymous], 1998, INT C MACH LEARN

[6] Recent Advances in Hierarchical Reinforcement Learning [J].

Andrew G. Barto ;

Sridhar Mahadevan .

Discrete Event Dynamic Systems, 2003, 13 (4) :341-379

[7] The Arcade Learning Environment: An Evaluation Platform for General Agents [J].

Bellemare, Marc G. ;

Naddaf, Yavar ;

Veness, Joel ;

Bowling, Michael .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 :253-279

[8]

Bertsekas D., 2012, Dynamic Programming and Optimal Control: Volume I, Vvol 1

[9]

Bullinaria J, 2005, INT C HYBR INT SYST

[10]

Deb K, 2002, IEEE C EVOL COMPUTAT, P825, DOI 10.1109/CEC.2002.1007032

← 1 2 3 4 →