On the sample complexity of actor-critic method for reinforcement learning with function approximation

被引：0

作者：

Harshat Kumar

Alec Koppel

Alejandro Ribeiro

机构：

[1] The University of Pennsylvania,Department of Electrical and Systems Engineering

[2] JPMorgan AI Research,undefined

来源：

Machine Learning | 2023年 / 112卷

关键词：

Actor-critic; Reinforcement learning; Markov decision process; Non-convex optimization; Stochastic programming;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.

引用

页码：2433 / 2467

页数：34

共 99 条

[1]

Antos A(2008)Fitted q-iteration in continuous action-space mdps Advances in Neural Information Processing Systems 20 9-16

[2]

Szepesvári C(2005)Dynamic programming and optimal control European Journal of Control 11 4-5

[3]

Munos R(2008)Incremental natural actor-critic algorithms Advances in Neural Information Processing Systems 20 105-112

[4]

Bertsekas DP(2009)Natural actor-critic algorithms Automatica 45 2471-2482

[5]

Bhatnagar S(1997)Stochastic approximation with two time scales Systems & Control Letters 29 291-294

[6]

Ghavamzadeh M(2000)The ode method for convergence of stochastic approximation and reinforcement learning SIAM Journal on Control and Optimization 38 447-469

[7]

Lee M(1998)Online learning and stochastic approximations On-line Learning in Neural Networks 17 142-526

[8]

Sutton RS(2002)Stability and generalization Journal of Machine Learning Research 2 499-376

[9]

Bhatnagar S(1995)Generalization in reinforcement learning: Safely approximating the value function Advances in Neural Information Processing Systems 7 369-5

[10]

Sutton R(2019)Neural temporal-difference learning converges to global optima Advances in Neural Information Processing Systems 32 4-410

← 1 2 3 4 5 6 7 8 9 10 →