Distributed Value Function Approximation for Collaborative Multiagent Reinforcement Learning

被引:11
作者
Stankovic, Milos S. [1 ,2 ]
Beko, Marko [3 ]
Stankovic, Srdjan S. [4 ,5 ]
机构
[1] Vlatacom Inst, Belgrade 11070, Serbia
[2] Singidunum Univ, Belgrade 11000, Serbia
[3] Univ Lisbon, Inst Telecomunicacoes, Inst Super Tecn, P-1649004 Lisbon, Portugal
[4] Univ Belgrade, Sch Elect Engn, Belgrade 11000, Serbia
[5] Univ Lusofona Humanidades & Tecnol, COPELABS, P-1749024 Lisbon, Portugal
来源
IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS | 2021年 / 8卷 / 03期
关键词
Collaborative networks; convergence analysis; decentralized algorithms; distributed consensus; multi-agent systems; reinforcement learning; temporal difference learning; value function; weak convergence; STOCHASTIC-APPROXIMATION; POLICY EVALUATION; WEAK-CONVERGENCE; OPTIMIZATION; NETWORKS;
D O I
10.1109/TCNS.2021.3061909
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, we propose several novel distributed gradient-based temporal-difference algorithms for multiagent off-policy learning of linear approximation of the value function in Markov decision processes with strict information structure constraints, limiting interagent communications to small neighborhoods. The algorithms are composed of the following: first, local parameter updates based on the single-agent off-policy gradient temporal-difference learning algorithms, including the eligibility traces with state-dependent parameters and, second, linear stochastic time-varying consensus schemes, represented by directed graphs. The proposed algorithms differ in their form, definition of eligibility traces, selection of time scales, and the way of incorporating consensus iterations. The main contribution of this article is a convergence analysis based on the general properties of the underlying Feller-Markov processes and the stochastic time-varying consensus model. We prove under general assumptions that the parameter estimates generated by all the proposed algorithms weakly converge to the corresponding ordinary differential equations with precisely defined invariant sets. It is demonstrated how the adopted methodology can be applied to temporal-difference algorithms under weaker information structure constraints. The variance reduction effect of the proposed algorithms is demonstrated by formulating and analyzing an asymptotic stochastic differential equation. Specific guidelines for the communication network design are provided. The algorithms' superior properties are illustrated by characteristic simulation results.
引用
收藏
页码:1270 / 1280
页数:11
相关论文
共 49 条
[1]  
[Anonymous], 2015, C LEARNING THEORY, P1724
[2]  
[Anonymous], 2009, P 26 ANN INT C MACH
[3]  
[Anonymous], 2017, ARXIV171209652
[4]  
[Anonymous], 2010, Proceedings of the Third Conference on Artificial General Intelligence
[5]  
Bertsekas D. P., 1996, NEURO DYNAMIC PROGRA
[6]   Performance of a Distributed Stochastic Approximation Algorithm [J].
Bianchi, Pascal ;
Fort, Gersende ;
Hachem, Walid .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2013, 59 (11) :7405-7418
[7]  
Borkar Vivek, 2009, Stochastic approximation:a dynamical systems viewpoint, V48
[8]   Asynchronous stochastic approximations [J].
Borkar, VS .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1998, 36 (03) :840-851
[9]   Stochastic approximation with two time scales [J].
Borkar, VS .
SYSTEMS & CONTROL LETTERS, 1997, 29 (05) :291-294
[10]   The ODE method for convergence of stochastic approximation and reinforcement learning [J].
Borkar, VS ;
Meyn, SP .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469