Model-Free Trajectory-based Policy Optimization with Monotonic Improvement

被引：0

作者：

Akrour, Riad ^{[1
]}

Abdolmaleki, Abbas ^{[2
]}

Abdulsamad, Hany ^{[1
]}

Peters, Jan ^{[1
,3
]}

Neumann, Gerhard ^{[1
,4
]}

机构：

[1] Tech Univ Darmstadt, CLAS IAS, Hsch Str 10, D-64289 Darmstadt, Germany

[2] DeepMind, London N1C 4AG, England

[3] Max Planck Inst Intelligent Syst, Max Planck Ring 4, Tubingen, Germany

[4] Univ Lincoln, L CAS, Lincoln LN6 7TS, England

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2018年 / 19卷

基金：

欧盟地平线“2020”;

关键词：

Reinforcement Learning; Policy Optimization; Trajectory Optimization; Robotics;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent Q-Function learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.

引用

页数：25

共 50 条

[41] Training recurrent neural networks via dynamical trajectory-based optimization
Khodabandehlou, Hamid
Fadali, M. Sami
NEUROCOMPUTING, 2019, 368 : 1 - 10
[42] Trajectory-based combustion control for renewable fuels in free piston engines
Zhang, Chen
Sun, Zongxuan
APPLIED ENERGY, 2017, 187 : 72 - 83
[43] Model-Free GPU Online Energy Optimization
Wang, Farui
Hao, Meng
Zhang, Weizhe
Wang, Zheng
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, 2024, 9 (02): : 141 - 154
[44] Nonlinear System Identification using Neural Networks and Trajectory-based Optimization
Khodabandehlou, Hamid
Fadali, M. Sami
ICINCO: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL 1, 2019, : 579 - 586
[45] A Modular and Model-Free Trajectory Planning Strategy for Automated Driving
Vosswinkel, Rick
Mutlu, Ilhan
Alaa, Khaled
Schrodel, Frank
2020 EUROPEAN CONTROL CONFERENCE (ECC 2020), 2020, : 1186 - 1191
[46] A New Model-Free Trajectory Tracking Control for Robot Manipulators
Wang, Yaoyao
Zhu, Kangwu
Chen, Bai
Wu, Hongtao
MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[47] Trajectory-Based Morphological Operators: A Model for Efficient Image Processing
Jimeno-Morenilla, Antonio
Pujol, Francisco A.
Molina-Carmona, Rafael
Sanchez-Romero, Jose L.
Pujol, Mar
SCIENTIFIC WORLD JOURNAL, 2014,
[48] A TRAJECTORY-BASED COMPUTATIONAL MODEL FOR OPTICAL-FLOW ESTIMATION
CHAUDHURY, K
MEHROTRA, R
IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, 1995, 11 (05): : 733 - 741
[49] Exploring the Accuracy of a Parallel Cooperative Model for Trajectory-Based Metaheuristics
Luque, Gabriel
Luna, Francisco
Alba, Enrique
Nesmachnow, Sergio
COMPUTER AIDED SYSTEMS THEORY - EUROCAST 2011, PT I, 2012, 6927 : 319 - 326
[50] On the Use of Social Trajectory-Based Clustering Methods for Public Transport Optimization
Nin, Jordi
Carrera, David
Villatoro, Daniel
CITIZEN IN SENSOR NETWORKS, 2014, 8313 : 59 - 70

← 1 2 3 4 5 →