Modular deep reinforcement learning from reward and punishment for robot navigation

被引:43
作者
Wang, Jiexin [1 ]
Elfwing, Stefan [2 ]
Uchibe, Eiji [1 ]
机构
[1] ATR Computat Neurosci Labs, Dept Brain Robot Interface, 2-2-2 Hikaridai, Kyoto 6190288, Japan
[2] ContextVision AB, Storgatan 24, S-58223 Linkoping, Sweden
基金
日本学术振兴会;
关键词
Modular reinforcement learning; Deep reinforcement learning; Max pain; Robot navigation; Maze solving; TEMPORAL DIFFERENCE MODELS;
D O I
10.1016/j.neunet.2020.12.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modular Reinforcement Learning decomposes a monolithic task into several tasks with sub-goals and learns each one in parallel to solve the original problem. Such learning patterns can be traced in the brains of animals. Recent evidence in neuroscience shows that animals utilize separate systems for processing rewards and punishments, illuminating a different perspective for modularizing Reinforcement Learning tasks. MaxPain and its deep variant, Deep MaxPain, showed the advances of such dichotomy-based decomposing architecture over conventional Q-learning in terms of safety and learning efficiency. These two methods differ in policy derivation. MaxPain linearly unified the reward and punishment value functions and generated a joint policy based on unified values; Deep MaxPain tackled scaling problems in high-dimensional cases by linearly forming a joint policy from two sub-policies obtained from their value functions. However, the mixing weights in both methods were determined manually, causing inadequate use of the learned modules. In this work, we discuss the signal scaling of reward and punishment related to discounting factor gamma, and propose a weak constraint for signaling design. To further exploit the learning models, we propose a state-value dependent weighting scheme that automatically tunes the mixing weights: hard-max and softmax based on a case analysis of Boltzmann distribution. We focus on maze-solving navigation tasks and investigate how two metrics (pain-avoiding and goal-reaching) influence each other's behaviors during learning. We propose a sensor fusion network structure that utilizes lidar and images captured by a monocular camera instead of lidar-only and image-only sensing. Our results, both in the simulation of three types of mazes with different complexities and a real robot experiment of an L-maze on Turtlebot3 Waffle Pi, showed the improvements of our methods. (c) 2020 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页码:115 / 126
页数:12
相关论文
共 48 条
[1]  
[Anonymous], 2017, TURTLEBOT 3 WAFFLE P
[2]  
[Anonymous], 2019, ARXIV190108649
[3]  
[Anonymous], 2016, CORR, DOI DOI 10.1016/j.neuroscience.2018.04.006
[4]  
ASADA M, 1994, IROS '94 - INTELLIGENT ROBOTS AND SYSTEMS: ADVANCED ROBOTIC SYSTEMS AND THE REAL WORLD, VOLS 1-3, P917, DOI 10.1109/IROS.1994.407484
[5]   Recent Advances in Hierarchical Reinforcement Learning [J].
Andrew G. Barto ;
Sridhar Mahadevan .
Discrete Event Dynamic Systems, 2003, 13 (4) :341-379
[6]  
Bhat Sooraj., 2006, NATL C ARTIFICIAL IN, V21, P318
[7]  
Dietterich T. G., 1998, Machine Learning. Proceedings of the Fifteenth International Conference (ICML'98), P118
[8]   Multiple model-based reinforcement learning [J].
Doya, K ;
Samejima, K ;
Katagiri, K ;
Kawato, M .
NEURAL COMPUTATION, 2002, 14 (06) :1347-1369
[9]   Simultaneous localization and mapping: Part I [J].
Durrant-Whyte, Hugh ;
Bailey, Tim .
IEEE ROBOTICS & AUTOMATION MAGAZINE, 2006, 13 (02) :99-108
[10]   Striatal structure and function predict individual biases in learning to avoid pain [J].
Eldar, Eran ;
Hauser, Tobias U. ;
Dayan, Peter ;
Dolan, Raymond J. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2016, 113 (17) :4812-4817