A Prioritized objective actor-critic method for deep reinforcement learning

被引:11
作者
Nguyen, Ngoc Duy [1 ]
Nguyen, Thanh Thi [2 ]
Vamplew, Peter [3 ]
Dazeley, Richard [4 ]
Nahavandi, Saeid [1 ]
机构
[1] Deakin Univ, Inst Intelligent Syst Res & Innovat, Waurn Ponds Campus, Geelong, Vic, Australia
[2] Deakin Univ, Sch Informat Technol, Burwood Campus, Melbourne, Vic, Australia
[3] Federat Univ Australia, Federat Learning Agents Grp, Sch Sci Engn & Informat Technol, Ballarat, Vic, Australia
[4] Deakin Univ, Sch Informat Technol, Waurn Ponds Campus, Geelong, Vic, Australia
关键词
Deep learning; Reinforcement learning; Learning systems; Multi-objective optimization; Actor-critic architecture;
D O I
10.1007/s00521-021-05795-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
An increasing number of complex problems have naturally posed significant challenges in decision-making theory and reinforcement learning practices. These problems often involve multiple conflicting reward signals that inherently cause agents' poor exploration in seeking a specific goal. In extreme cases, the agent gets stuck in a sub-optimal solution and starts behaving harmfully. To overcome such obstacles, we introduce two actor-critic deep reinforcement learning methods, namely Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), which can adjust agent behaviors to efficiently achieve a designated goal by adopting a weighted-sum scalarization of different objective functions. In particular, MCSP creates a human-centric policy that corresponds to a predefined priority weight of different objectives. Whereas, SCMP is capable of generating a mixed policy based on a set of priority weights, i.e., the generated policy uses the knowledge of different policies (each policy corresponds to a priority weight) to dynamically prioritize objectives in real time. We examine our methods by using the Asynchronous Advantage Actor-Critic (A3C) algorithm to utilize the multithreading mechanism for dynamically balancing training intensity of different policies into a single network. Finally, simulation results show that MCSP and SCMP significantly outperform A3C with respect to the mean of total rewards in two complex problems: Food Collector and Seaquest.
引用
收藏
页码:10335 / 10349
页数:15
相关论文
共 37 条
[1]  
Amodei D., 2016, ARXIV160606565CS
[2]  
[Anonymous], 2016, ICLR
[3]  
[Anonymous], 2005, Advances in neural information processing systems, DOI DOI 10.21236/ADA440280
[4]  
[Anonymous], 2015, ARXIV151106581CS
[5]  
[Anonymous], 1998, INT C MACH LEARN
[6]   Recent Advances in Hierarchical Reinforcement Learning [J].
Andrew G. Barto ;
Sridhar Mahadevan .
Discrete Event Dynamic Systems, 2003, 13 (4) :341-379
[7]   The Arcade Learning Environment: An Evaluation Platform for General Agents [J].
Bellemare, Marc G. ;
Naddaf, Yavar ;
Veness, Joel ;
Bowling, Michael .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 :253-279
[8]  
Bertsekas D., 2012, Dynamic Programming and Optimal Control: Volume I, Vvol 1
[9]  
Bullinaria J, 2005, INT C HYBR INT SYST
[10]  
Deb K, 2002, IEEE C EVOL COMPUTAT, P825, DOI 10.1109/CEC.2002.1007032