Robot Skill Learning: From Reinforcement Learning to Evolution Strategies

被引:2
作者
机构
[1] Robotics and Computer Vision, ENSTA-ParisTech, Paris
[2] Flowers Team, Inria Bordeaux Sud-Ouest, Talence
[3] Institut des Systèmes Intelligents et de Robotique, Université Pierre Marie Curie Cnrs, Umr 7222, Paris
来源
| 1600年 / De Gruyter Open Ltd卷 / 04期
关键词
black-box optimization; dynamic movement primitives; evolution strategies; reinforcement learning;
D O I
10.2478/pjbr-2013-0003
中图分类号
学科分类号
摘要
Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. Owing to current trends involving searching in parameter space (rather than action space) and using reward-weighted averaging (rather than gradient estimation), reinforcement learning algorithms for policy improvement, e.g. PoWER and PI2, are now able to learn sophisticated high-dimensional robot skills. A side-effect of these trends has been that, over the last 15 years, reinforcement learning (RL) algorithms have become more and more similar to evolution strategies such as (μW, λ)-ES and CMA-ES. Evolution strategies treat policy improvement as a black-box optimization problem, and thus do not leverage the problem structure, whereas RL algorithms do. In this paper, we demonstrate how two straightforward simplifications to the state-of-the-art RL algorithm PI2 suffice to convert it into the black-box optimization algorithm (μW, λ)-ES. Furthermore, we show that (μW, λ)-ES empirically outperforms PI2 on the tasks in [36]. It is striking that PI2 and (μW, λ)-ES share a common core, and that the simpler algorithm converges faster and leads to similar or lower final costs. We argue that this difference is due to a third trend in robot skill learning: the predominant use of dynamic movement primitives (DMPs). We show how DMPs dramatically simplify the learning problem, and discuss the implications of this for past and future work on policy improvement for robot skill learning ©.
引用
收藏
页码:49 / 61
页数:12
相关论文
共 41 条
  • [1] Arnold L., Auger A., Hansen N., Ollivier Y., Informationgeometric Optimization Algorithms: A Unifying Picture Via Invariance Principles, (2011)
  • [2] Barto A., Mahadevan S., Recent advances in hierarchical reinforcement learning, Discrete Event Systems, 13, 1-2, pp. 41-77, (2003)
  • [3] Beyer H.-G., Schwefel H.-P., Evolution strategies-a comprehensive introduction, Natural Computing, 1, 1, pp. 3-52, (2002)
  • [4] Busoniu L., Ernst D., De Schutter B., Babuska R., Crossentropy optimization of control policies with adaptive basis functions, IEEE Transactions on Systems, Man, andCybernetics-Part B: Cybernetics, 41, 1, pp. 196-209, (2011)
  • [5] Gomez F., Schmidhuber J., Miikkulainen R., Accelerated neural evolution through cooperatively coevolved synapses, Journalof Machine Learning Research, 9, pp. 937-965, (2008)
  • [6] Hansen N., Ostermeier A., Completely derandomized selfadaptation in evolution strategies, Evolutionary Computation, 9, 2, pp. 159-195, (2001)
  • [7] Hansen N., The CMA Evolution Strategy: A Tutorial, (2011)
  • [8] Heidrich-Meisner V., Igel C., Evolution strategies for direct policy search, Proceedings of the 10th interna-tional conference on Parallel Problem Solving from Nature:PPSN X, pp. 428-437, (2008)
  • [9] Heidrich-Meisner V., Igel C., Similarities and differences between policy gradient methods and evolution strategies, ESANN 2008, 16th European Symposium on Artifi-cial Neural Networks, Bruges, Belgium, April 23-25, 2008,Proceedings, pp. 149-154, (2008)
  • [10] Ijspeert A., Nakanishi J., Pastor P., Hoffmann H., Schaal S., Dynamical Movement Primitives: Learning attractor models for motor behaviors, Neural Computation, 25, 2, pp. 328-373, (2013)