Communication-Efficient Policy G ad en Methods for Distributed Reinforcement Learning

被引：26

作者：

Chen, Tianyi ^{[1
]}

Zhang, Kaiqing ^{[2
]}

Giannakis, Georgios B. ^{[3
]}

Basar, Tamer ^{[2
]}

机构：

[1] Rensselaer Polytech Inst, Dept Elect Comp & Syst Engn, Troy, NY 12180 USA

[2] Univ Illinois, Dept Elect & Comp Engn, Urbana, IL 61801 USA

[3] Univ Minnesota, Digital Technol Ctr, Dept Elect & Comp Engn, Minneapolis, MN 55455 USA

来源：

IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS | 2022年 / 9卷 / 02期

基金：

美国国家科学基金会;

关键词：

Communication-efficient learning; distributed learning; multiagent; policy gradient; reinforcement learning;

D O I：

10.1109/TCNS.2021.3078100

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This article deals with distributed policy optimization in reinforcement learning, which involves a central controller and a group of learners. In particular, two typical settings encountered in several applications are considered: multiagent reinforcement learning (RL) and parallel RL, where frequent information exchanges between the learners and the controller are required. For many practical distributed systems, however, the overhead caused by these frequent communication exchanges is considerable, and becomes the bottleneck of the overall performance. To address this challenge, a novel policy gradient approach is developed for solving distributed RL. The novel approach adaptively skips the policy gradient communication during iterations, and can reduce the communication overhead without degrading learning performance. It is established analytically that: i) the novel algorithm has a convergence rate identical to that of the plain-vanilla policy gradient; while ii) if the distributed learners are heterogeneous in terms of their reward functions, the number of communication rounds needed to achieve a desirable learning accuracy is markedly reduced. Numerical experiments corroborate the communication reduction attained by the novel algorithm compared to alternatives.

引用

页码：917 / 929

页数：13

共 55 条

[1] [Anonymous], 2017, ADV NEURAL INF PROCE
[2] [Anonymous], 2017, GOOGLE RES BLOG APR
[3] [Anonymous], Massively parallel methods for deep reinforcement learning
[4] Infinite-horizon policy-gradient estimation
Baxter, J
Bartlett, PL
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 : 319 - 350
[5] The complexity of decentralized control of Markov decision processes
Bernstein, DS
Givan, R
Immerman, N
Zilberstein, S
[J]. MATHEMATICS OF OPERATIONS RESEARCH, 2002, 27 (04) : 819 - 840
[6] Boyan J.A., 1994, Advances in Neural Information Processing Systems, V6
[7] Brockman Greg, 2016, arXiv
[8] Chen R T Q., 2018, ADV NEURAL INFORM PR, Vvol 31
[9] Chen T., 2018, arXiv:1812.03239
[10] Bandit Convex Optimization for Scalable and Dynamic IoT Management
Chen, Tianyi
Giannakis, Georgios B.
[J]. IEEE INTERNET OF THINGS JOURNAL, 2019, 6 (01) : 1276 - 1286

← 1 2 3 4 5 6 →