Model selection in reinforcement learning

被引：0

作者：

Amir-massoud Farahmand

Csaba Szepesvári

机构：

[1] University of Alberta,Department of Computing Science

来源：

Machine Learning | 2011年 / 85卷

关键词：

Reinforcement learning; Model selection; Complexity regularization; Adaptivity; Offline learning; Off-policy learning; Finite-sample bounds;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions. We propose a complexity regularization-based model selection algorithm, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\ensuremath{\mbox{\textsc {BErMin}}}$\end{document}, and prove that it enjoys an oracle-like property: the estimator’s error differs from that of an oracle, who selects the candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric rate as the number of samples increases. As an application, we consider a problem when the true action-value function belongs to an unknown member of a nested sequence of function spaces. We show that under some additional technical conditions \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\ensuremath{\mbox{\textsc {BErMin}}}$\end{document} leads to a procedure whose rate of convergence, up to a constant factor, matches that of an oracle who knows which of the nested function spaces the true action-value function belongs to, i.e., the procedure achieves adaptivity.

引用

页码：299 / 332

页数：33

共 32 条

[1]

Antos A.(2008)Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path Machine Learning 71 89-129

[2]

Szepesvári Cs.(2009)A survey of cross-validation procedures for model selection Statistics Surveys 4 40-79

[3]

Munos R.(2002)Model selection and error estimation Machine Learning 48 85-113

[4]

Arlot S.(2005)Local Rademacher complexities Annals of Statistics 33 1497-1537

[5]

Celisse A.(2005)Tree-based batch mode reinforcement learning Journal of Machine Learning Research 6 503-556

[6]

Bartlett P. L.(2003)Least-squares policy iteration Journal of Machine Learning Research 4 1107-1149

[7]

Boucheron S.(2004)Complexity regularization via localized random penalties Annals of Statistics 32 1679-1697

[8]

Lugosi G.(2000)Nonparametric time series prediction through adaptive model selection Machine Learning 39 5-34

[9]

Bartlett P. L.(2005)Basis function adaptation in temporal difference reinforcement learning Annals of Operation Research 134 215-238

[10]

Bousquet O.(1998)Memory-universal prediction of stationary random processes IEEE Transactions on Information Theory 44 117-133

← 1 2 3 4 →