Multi-Alpha Soft Actor-Critic: Overcoming Stochastic Biases in Maximum Entropy Reinforcement Learning

被引:1
作者
Igoe, Conor [1 ]
Pande, Swapnil [2 ]
Venkatraman, Siddarth [2 ]
Schneider, Jeff [2 ]
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Machine Learning Dept, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Inst Robot, Sch Comp Sci, Pittsburgh, PA 15213 USA
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023) | 2023年
关键词
D O I
10.1109/ICRA48891.2023.10161395
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The successful application of robotic control requires intelligent decision-making to handle the long tail of complex scenarios that arise in real-world environments. Recently, Deep Reinforcement Learning (DRL) has provided a datadriven framework to automatically learn effective policies in such complex settings. Since its introduction in 2018, Soft Actor-Critic (SAC) remains as one of the most popular off-policy DRL algorithms and has been used extensively to learn performant robotic control policies. However, in this paper we argue that by relying on the maximum entropy formalism to define learning objectives, previous work introduces a significant bias away from optimal decision making, which often requires near-deterministic behaviour for high-precision tasks. Moreover, we show that when training with the original variants of SAC, overcoming this bias by reducing entropy budgets or entropy coefficients introduces separate issues that lead to slow or unstable learning. We address these shortcomings by treating the entropy coefficient a as a random variable and introduce Multi-Alpha Soft Actor-Critic (MAS). We show how MAS overcomes the stochastic bias of SAC in a variety of robotic control tasks including the CARLA urban-driving simulator, while maintaining the stability and sample efficiency of the original algorithms.
引用
收藏
页码:7162 / 7168
页数:7
相关论文
共 29 条
[21]  
Schulman J., 2017, arXiv
[22]  
Sutton RS, 2018, ADAPT COMPUT MACH LE, P1
[23]  
Theodorou EA, 2010, J MACH LEARN RES, V11, P3137
[24]  
Todorov E., 2006, Advances in neural information processing systems, V19
[25]   End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances [J].
Toromanoff, Marin ;
Wirbel, Emilie ;
Moutarde, Fabien .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :7151-7160
[26]  
Toussaint M., 2009, P 26 ANN INT C MACH, P1049, DOI 10.1145/1553374.1553508
[27]  
Wang JX, 2017, Arxiv, DOI [arXiv:1611.05763, 10.48550/arXiv.1611.05763]
[28]  
Ziebart B.D., 2010, MODELING PURPOSEFUL
[29]   Navigate Like a Cabbie: Probabilistic Reasoning from Observed Context-Aware Behavior [J].
Ziebart, Brian D. ;
Maas, Andrew L. ;
Dey, Anind K. ;
Bagnell, J. Andrew .
PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING (UBICOMP 2008), 2008, :322-331