Distributed consensus-based multi-agent temporal-difference learning

被引：5

作者：

Stankovic, Milos S. ^{[1
,2
]}

Beko, Marko ^{[3
,5
]}

Stankovic, Srdjan S. ^{[4
,5
]}

机构：

[1] Singidunum Univ, Belgrade, Serbia

[2] Vlatacom Inst, Belgrade, Serbia

[3] Univ Lisbon, Inst Telecomunicacoes, Inst Super Tecn, Lisbon, Portugal

[4] Univ Belgrade, Sch Elect Engn, Belgrade, Serbia

[5] Univ Lusofona, COPELABS, Lisbon, Portugal

来源：

AUTOMATICA | 2023年 / 151卷

关键词：

REINFORCEMENT; OPTIMIZATION; NETWORKS;

D O I：

10.1016/j.automatica.2023.110922

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper we propose two new distributed consensus-based algorithms for temporal-difference learning in multi-agent Markov decision processes. The algorithms are of off-policy type and are aimed at linear approximation of the value function. Restricting agents' observations to local data and communications to their small neighborhoods, the algorithms consist of: (a) local updates of the parameter estimates based on either the standard TD(),) or the emphatic ETD(),) algorithm, and (b) dynamic consensus scheme implemented over a time-varying lossy communication network. The algorithms are completely decentralized, allowing efficient parallelization and applications where the agents may have different behavior policies and different initial state distributions while evaluating a common target policy. It is proved under nonrestrictive assumptions that the proposed algorithms weakly converge to the solutions of the mean ordinary differential equation (ODE) common for all the agents. It is also proved that the whole system may be stabilized by a proper choice of the network and that the parameter estimates weakly converge to consensus. Discussion is given on the asymptotic bias and variance of the estimates, on the projected forms of the proposed algorithms, as well as on restrictiveness of the adopted assumptions. Simulation results illustrate the main properties of the algorithms and provide comparisons with similar schemes.& COPY; 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

引用

页数：11

共 54 条

[1]

Bertsekas D. P., 1996, Neuro-Dynamic Programming

[2]

Bhandari J, 2018, Arxiv, DOI arXiv:1806.02450

[3]

Borkar V, 2024, Arxiv, DOI arXiv:2110.14427

[4] A comprehensive survey of multiagent reinforcement learning [J].

Busoniu, Lucian ;

Babuska, Robert ;

De Schutter, Bart .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2008, 38 (02) :156-172

[5]

Cassano L, 2019, 2019 18TH EUROPEAN CONTROL CONFERENCE (ECC), P505, DOI [10.23919/ECC.2019.8795670, 10.23919/ecc.2019.8795670]

[6]

Dalal G, 2018, AAAI CONF ARTIF INTE, P6144

[7]

Ding DS, 2021, Arxiv, DOI arXiv:1908.02805

[8]

Doan TT, 2019, PR MACH LEARN RES, V97

[9]

Durmus A., 2021, PROC MACHINE LEARNIN, V134, P1

[10]

Geist M, 2014, J MACH LEARN RES, V15, P289

← 1 2 3 4 5 6 →