Multi-Alpha Soft Actor-Critic: Overcoming Stochastic Biases in Maximum Entropy Reinforcement Learning

被引:1
作者
Igoe, Conor [1 ]
Pande, Swapnil [2 ]
Venkatraman, Siddarth [2 ]
Schneider, Jeff [2 ]
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Machine Learning Dept, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Inst Robot, Sch Comp Sci, Pittsburgh, PA 15213 USA
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023) | 2023年
关键词
D O I
10.1109/ICRA48891.2023.10161395
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The successful application of robotic control requires intelligent decision-making to handle the long tail of complex scenarios that arise in real-world environments. Recently, Deep Reinforcement Learning (DRL) has provided a datadriven framework to automatically learn effective policies in such complex settings. Since its introduction in 2018, Soft Actor-Critic (SAC) remains as one of the most popular off-policy DRL algorithms and has been used extensively to learn performant robotic control policies. However, in this paper we argue that by relying on the maximum entropy formalism to define learning objectives, previous work introduces a significant bias away from optimal decision making, which often requires near-deterministic behaviour for high-precision tasks. Moreover, we show that when training with the original variants of SAC, overcoming this bias by reducing entropy budgets or entropy coefficients introduces separate issues that lead to slow or unstable learning. We address these shortcomings by treating the entropy coefficient a as a random variable and introduce Multi-Alpha Soft Actor-Critic (MAS). We show how MAS overcomes the stochastic bias of SAC in a variety of robotic control tasks including the CARLA urban-driving simulator, while maintaining the stability and sample efficiency of the original algorithms.
引用
收藏
页码:7162 / 7168
页数:7
相关论文
共 29 条
[1]  
Agarwal T, 2021, Arxiv, DOI arXiv:2101.05970
[2]  
Andrychowicz M, 2021, INT C LEARN REPR
[3]   Optimizing hyperparameters of deep reinforcement learning for autonomous driving based on whale optimization algorithm [J].
Ashraf, Nesma M. ;
Mostafa, Reham R. ;
Sakr, Rasha H. ;
Rashad, M. Z. .
PLOS ONE, 2021, 16 (06)
[4]  
Badia AP, 2020, PR MACH LEARN RES, V119
[5]   Exploring the Limitations of Behavior Cloning for Autonomous Driving [J].
Codevilla, Felipe ;
Santana, Eder ;
Lopez, Antonio M. ;
Gaidon, Adrien .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9328-9337
[6]  
Dosovitskiy A, 2017, PR MACH LEARN RES, V78
[7]  
Duan Y, 2016, Arxiv, DOI [arXiv:1611.02779, 10.48550/arXiv.1611.02779]
[8]  
Engstrom L., 2020, INT C LEARNING REPRE
[9]  
Eysenbach Benjamin, 2022, INT C LEARN REPR
[10]  
Finn C, 2017, PR MACH LEARN RES, V70