Continuous action reinforcement learning for control-affine systems with unknown dynamics

被引：14

作者：

Faust, Aleksandra ^{[1
]}

Ruymgaart, Peter ^{[1
]}

Salman, Molly ^{[2
]}

Fierro, Rafael ^{[3
]}

Tapia, Lydia ^{[1
]}

机构：

[1] Department of Computer Science, University of New Mexico, Albuquerque, 87131, NM

[2] Computer Science Department, Austin College, Sherman, 75090, TX

[3] Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, 87131, NM

来源：

IEEE/CAA Journal of Automatica Sinica | 2014年 / 1卷 / 03期

基金：

美国国家卫生研究院; 美国国家科学基金会;

关键词：

approximate value iteration; continuous action spaces; control-affine nonlinear systems; fitted value iteration; policy approximation; Reinforcement learning;

D O I：

10.1109/JAS.2014.7004690

中图分类号：

学科分类号：

摘要：

Control of nonlinear systems is challenging in realtime. Decision making, performed many times per second, must ensure system safety. Designing input to perform a task often involves solving a nonlinear system of differential equations, which is a computationally intensive, if not intractable problem. This article proposes sampling-based task learning for control-affine nonlinear systems through the combined learning of both state and action-value functions in a model-free approximate value iteration setting with continuous inputs. A quadratic negative definite state-value function implies the existence of a unique maximum of the action-value function at any state. This allows the replacement of the standard greedy policy with a computationally efficient policy approximation that guarantees progression to a goal state without knowledge of the system dynamics. The policy approximation is consistent, i.e., it does not depend on the action samples used to calculate it. This method is appropriate for mechanical systems with high-dimensional input spaces and unknown dynamics performing Constraint-Balancing Tasks. We verify it both in simulation and experimentally for an Unmanned Aerial Vehicles (UAVs) carrying a suspended load, and in simulation, for the rendezvous of heterogeneous robots. © 2014 Chinese Association of Automation.

引用

页码：323 / 336

页数：13

共 33 条

[11]

Vamvoudakis K.G., Vrabie D., Lewis F.L., Online adaptive algorithm for optimal control with integral reinforcement learning, International Journal of Robust and Nonlinear Control

[12]

Mehraeen S., Jagannathan S., Decentralized nearly optimal control of a class of interconnected nonlinear discrete-time systems by using online Hamilton-Bellman-Jacobi formulation, Proceeding of the 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1-8, (2010)

[13]

Bhasin S., Sharma N., Patre P., Dixon W., Asymptotic tracking by a reinforcement learning-based adaptive critic controller, Journal of Control Theory and Applications, 9, 3, pp. 400-409, (2011)

[14]

Modares H., Sistani M.B.N., Lewis F.L., A policy iteration approach to online optimal control of continuous-time constrained-input systems, ISA Transactions, 52, 5, pp. 611-621, (2013)

[15]

Chen Z., Jagannathan S., Generalized Hamilton-Jacobi-Bellman formulation-based neural network control of affine nonlinear discrete time systems, IEEE Transactions on Neural Networks, 19, 1, pp. 90-106, (2008)

[16]

Jiang Y., Jiang Z.P., Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics, Automatica, 48, 10, pp. 2699-2704, (2012)

[17]

Al-Tamimi A., Lewis F., Abu-Khalaf M., Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 38, 4, pp. 943-949, (2008)

[18]

Cheng T., Lewis F.L., Abu-Khalaf M., A neural network solution for fixed-final time optimal control of nonlinear systems, Automatica, 43, 3, pp. 482-490, (2007)

[19]

Kober J., Bagnell D., Peters J., Reinforcement learning in robotics: A survey, International Journal of Robotics Research, 32, 11, pp. 1236-1274, (2013)

[20]

Hasselt H., Reinforcement learning in continuous state and action spaces, Adaptation, Learning, and Optimization, pp. 207-251, (2012)

← 1 2 3 4 →