A Survey on Policy Search Algorithms for Learning Robot Controllers in a Handful of Trials

被引:100
作者
Chatzilygeroudis, Konstantinos [1 ,2 ]
Vassiliades, Vassilis [1 ,3 ]
Stulp, Freek [4 ]
Calinon, Sylvain [5 ]
Mouret, Jean-Baptiste [1 ]
机构
[1] Univ Lorraine, LORIA, Inria, Ctr Natl Rech Sci CNRS, F-54000 Nancy, France
[2] Ecole Polytech Fed Lausanne, Learning Algorithms & Syst Lab, CH-1015 Lausanne, Switzerland
[3] Res Ctr Interact Media Smart Syst & Emerging Tech, CY-1500 Nicosia, Cyprus
[4] German Aerosp Ctr DLR, Inst Robot & Mechatron, D-82234 Wessling, Germany
[5] Idiap Res Inst, CH-1920 Martigny, Switzerland
基金
英国工程与自然科学研究理事会; 欧洲研究理事会;
关键词
Autonomous agents; learning and adaptive systems; micro-data policy search (MDPS); robot learning; MODEL-PREDICTIVE CONTROL; BAYESIAN OPTIMIZATION; GAUSSIAN-PROCESSES; EVOLUTIONARY OPTIMIZATION; MOVEMENT PRIMITIVES; FEEDBACK-CONTROL; REINFORCEMENT; UNCERTAINTY; ADAPTATION; SELECTION;
D O I
10.1109/TRO.2019.2958211
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Most policy search (PS) algorithms require thousands of training episodes to find an effective policy, which is often infeasible with a physical robot. This survey article focuses on the extreme other end of the spectrum: how can a robot adapt with only a handful of trials (a dozen) and a few minutes? By analogy with the word "big-data," we refer to this challenge as "micro-data reinforcement learning." In this article, we show that a first strategy is to leverage prior knowledge on the policy structure (e.g., dynamic movement primitives), on the policy parameters (e.g., demonstrations), or on the dynamics (e.g., simulators). A second strategy is to create data-driven surrogate models of the expected reward (e.g., Bayesian optimization) or the dynamical model (e.g., model-based PS), so that the policy optimizer queries the model instead of the real system. Overall, all successful micro-data algorithms combine these two strategies by varying the kind of model and prior knowledge. The current scientific challenges essentially revolve around scaling up to complex robots, designing generic priors, and optimizing the computing time.
引用
收藏
页码:328 / 347
页数:20
相关论文
共 199 条
[1]  
Abbeel P., 2006, Proceedings of ICML, P1, DOI [DOI 10.1145/1143844.1143845, 10.1145/1143844.1143845]
[2]  
Abdolmaleki A, 2017, PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1378
[3]   Model-Based Relative Entropy Stochastic Search [J].
Abdolmaleki, Abbas ;
Lioutikov, Rudolf ;
Lau, Nuno ;
Reis, Luis Paulo ;
Peters, Jan ;
Neumann, Gerhard .
PROCEEDINGS OF THE 2016 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'16 COMPANION), 2016, :153-154
[4]  
Akrour R, 2017, PR MACH LEARN RES, V70
[5]  
Anderson B. D. O., 1979, Optimal filtering
[6]  
[Anonymous], 2016, ARXIV161105763
[7]  
[Anonymous], 2017, C ROBOT LEARNING
[8]  
[Anonymous], 2018, ARXIV180310122
[9]  
[Anonymous], 2015, NATURE, DOI DOI 10.1038/NATURE14539
[10]  
[Anonymous], 2017, NEURAL INFORM PROCES