MEC-A Near-Optimal Online Reinforcement Learning Algorithm for Continuous Deterministic Systems

被引：63

作者：

Zhao, Dongbin ^{[1
]}

Zhu, Yuanheng ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Automat, State Key Lab Management & Control Complex Syst, Beijing 100190, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2015年 / 26卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Efficient exploration; probably approximately correct (PAC); reinforcement learning (RL); state aggregation; TIME NONLINEAR-SYSTEMS; MODEL-BASED EXPLORATION; ZERO-SUM GAMES; CONTROL SCHEME;

D O I：

10.1109/TNNLS.2014.2371046

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, the first probably approximately correct (PAC) algorithm for continuous deterministic systems without relying on any system dynamics is proposed. It combines the state aggregation technique and the efficient exploration principle, and makes high utilization of online observed samples. We use a grid to partition the continuous state space into different cells to save samples. A near-upper Q operator is defined to produce a near-upper Q function using samples in each cell. The corresponding greedy policy effectively balances between exploration and exploitation. With the rigorous analysis, we prove that there is a polynomial time bound of executing nonoptimal actions in our algorithm. After finite steps, the final policy reaches near optimal in the framework of PAC. The implementation requires no knowledge of systems and has less computation complexity. Simulation studies confirm that it is a better performance than other similar PAC algorithms.

引用

页码：346 / 356

页数：11

共 42 条

[1]

[Anonymous], 2009, Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems

[2]

[Anonymous], 1998, Reinforcement Learning: An Introduction

[3]

Bai XR, 2009, INT J INNOV COMPUT I, V5, P3471

[4] Adaptive-resolution reinforcement learning with polynomial exploration in deterministic domains [J].

Bernstein, Andrey ;

Shimkin, Nahum .

MACHINE LEARNING, 2010, 81 (03) :359-397

[5] R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning [J].

Brafman, RI ;

Tennenholtz, M .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (02) :213-231

[6]

Busoniu L, 2010, AUTOM CONTROL ENG SE, P1, DOI 10.1201/9781439821091-f

[7] Online Selective Kernel-Based Temporal Difference Learning [J].

Chen, Xingguo ;

Gao, Yang ;

Wang, Ruili .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (12) :1944-1956

[8]

Grondman I., 2013, 2013 INT JOINT C NEU, P1

[9]

Jong NK, 2007, LECT NOTES ARTIF INT, V4612, P258

[10]

Jung T, 2010, LECT NOTES ARTIF INT, V6321, P601, DOI 10.1007/978-3-642-15880-3_44

← 1 2 3 4 5 →