Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning

被引:0
作者
Doan, Thinh T. [1 ,2 ]
Maguluri, Siva Theja [1 ]
Romberg, Justin [2 ]
机构
[1] Georgia Inst Technol, Sch Ind & Syst Engn, Atlanta, GA 30332 USA
[2] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97 | 2019年 / 97卷
关键词
POLICY EVALUATION; NETWORKS; ALGORITHMS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the policy evaluation problem in multi-agent reinforcement learning. In this problem, a group of agents work cooperatively to evaluate the value function for the global discounted accumulative reward problem, which is composed of local rewards observed by the agents. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed consensus-based variant of the popular temporal difference learning algorithm TD(0). While distributed reinforcement learning algorithms have been presented in the literature, almost nothing is known about their convergence rate. Our main contribution is providing a finite-time analysis for the convergence of the distributed TD(0) algorithm. We do this when the communication network between the agents is time-varying in general. We obtain an explicit upper bound on the rate of convergence of this algorithm as a function of the network topology and the discount factor. Our results mirror what we would expect from using distributed stochastic gradient descent for solving convex optimization problems.
引用
收藏
页数:10
相关论文
共 37 条
  • [1] Abbeel P., 2007, Advances in neural information processing systems, V19, P1
  • [2] [Anonymous], 1994, INCREMENTAL LEARNING
  • [3] [Anonymous], 1999, NEURODYNAMIC PROGRAM
  • [4] [Anonymous], 2010, SYNTHESIS LECT ARTIF
  • [5] Self-Organization in Small Cell Networks: A Reinforcement Learning Approach
    Bennis, Mehdi
    Perlaza, Samir M.
    Blasco, Pol
    Han, Zhu
    Poor, H. Vincent
    [J]. IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2013, 12 (07) : 3202 - 3212
  • [6] Bhandari J., 2018, C LEARNING THEORY
  • [7] Borkar V. S., 2008, Stochastic Approximation: A Dynamical Systems Viewpoint
  • [8] The ODE method for convergence of stochastic approximation and reinforcement learning
    Borkar, VS
    Meyn, SP
    [J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) : 447 - 469
  • [9] Bradtke SJ, 1996, MACH LEARN, V22, P33, DOI 10.1007/BF00114723
  • [10] DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving
    Chen, Chenyi
    Seff, Ari
    Kornhauser, Alain
    Xiao, Jianxiong
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2722 - 2730