Robot Skill Learning: From Reinforcement Learning to Evolution Strategies

被引：2

作者：

机构：

[1] Robotics and Computer Vision, ENSTA-ParisTech, Paris

[2] Flowers Team, Inria Bordeaux Sud-Ouest, Talence

[3] Institut des Systèmes Intelligents et de Robotique, Université Pierre Marie Curie Cnrs, Umr 7222, Paris

来源：

| 1600年 / De Gruyter Open Ltd卷 / 04期

关键词：

black-box optimization; dynamic movement primitives; evolution strategies; reinforcement learning;

D O I：

10.2478/pjbr-2013-0003

中图分类号：

学科分类号：

摘要：

Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. Owing to current trends involving searching in parameter space (rather than action space) and using reward-weighted averaging (rather than gradient estimation), reinforcement learning algorithms for policy improvement, e.g. PoWER and PI2, are now able to learn sophisticated high-dimensional robot skills. A side-effect of these trends has been that, over the last 15 years, reinforcement learning (RL) algorithms have become more and more similar to evolution strategies such as (μW, λ)-ES and CMA-ES. Evolution strategies treat policy improvement as a black-box optimization problem, and thus do not leverage the problem structure, whereas RL algorithms do. In this paper, we demonstrate how two straightforward simplifications to the state-of-the-art RL algorithm PI2 suffice to convert it into the black-box optimization algorithm (μW, λ)-ES. Furthermore, we show that (μW, λ)-ES empirically outperforms PI2 on the tasks in [36]. It is striking that PI2 and (μW, λ)-ES share a common core, and that the simpler algorithm converges faster and leads to similar or lower final costs. We argue that this difference is due to a third trend in robot skill learning: the predominant use of dynamic movement primitives (DMPs). We show how DMPs dramatically simplify the learning problem, and discuss the implications of this for past and future work on policy improvement for robot skill learning ©.

引用

页码：49 / 61

页数：12

共 41 条

[1] Arnold L., Auger A., Hansen N., Ollivier Y., Informationgeometric Optimization Algorithms: A Unifying Picture Via Invariance Principles, (2011)
[2] Barto A., Mahadevan S., Recent advances in hierarchical reinforcement learning, Discrete Event Systems, 13, 1-2, pp. 41-77, (2003)
[3] Beyer H.-G., Schwefel H.-P., Evolution strategies-a comprehensive introduction, Natural Computing, 1, 1, pp. 3-52, (2002)
[4] Busoniu L., Ernst D., De Schutter B., Babuska R., Crossentropy optimization of control policies with adaptive basis functions, IEEE Transactions on Systems, Man, andCybernetics-Part B: Cybernetics, 41, 1, pp. 196-209, (2011)
[5] Gomez F., Schmidhuber J., Miikkulainen R., Accelerated neural evolution through cooperatively coevolved synapses, Journalof Machine Learning Research, 9, pp. 937-965, (2008)
[6] Hansen N., Ostermeier A., Completely derandomized selfadaptation in evolution strategies, Evolutionary Computation, 9, 2, pp. 159-195, (2001)
[7] Hansen N., The CMA Evolution Strategy: A Tutorial, (2011)
[8] Heidrich-Meisner V., Igel C., Evolution strategies for direct policy search, Proceedings of the 10th interna-tional conference on Parallel Problem Solving from Nature:PPSN X, pp. 428-437, (2008)
[9] Heidrich-Meisner V., Igel C., Similarities and differences between policy gradient methods and evolution strategies, ESANN 2008, 16th European Symposium on Artifi-cial Neural Networks, Bruges, Belgium, April 23-25, 2008,Proceedings, pp. 149-154, (2008)
[10] Ijspeert A., Nakanishi J., Pastor P., Hoffmann H., Schaal S., Dynamical Movement Primitives: Learning attractor models for motor behaviors, Neural Computation, 25, 2, pp. 328-373, (2013)

← 1 2 3 4 5 →