Federated Reinforcement Learning for Training Control Policies on Multiple IoT Devices

被引：54

作者：

Lim, Hyun-Kyo ^{[1
]}

Kim, Ju-Bong ^{[2
]}

Heo, Joo-Seong ^{[1
]}

Han, Youn-Hee ^{[2
]}

机构：

[1] Korea Univ Technol & Educ, Dept Interdisciplinary Program Creat Engn, Cheonan 31253, South Korea

[2] Korea Univ Technol & Educ, Dept Comp Sci Engn, Cheonan 31253, South Korea

来源：

SENSORS | 2020年 / 20卷 / 05期

基金：

新加坡国家研究基金会;

关键词：

Actor-Critic PPO; federated reinforcement learning; multi-device control;

D O I：

10.3390/s20051359

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Reinforcement learning has recently been studied in various fields and also used to optimally control IoT devices supporting the expansion of Internet connection beyond the usual standard devices. In this paper, we try to allow multiple reinforcement learning agents to learn optimal control policy on their own IoT devices of the same type but with slightly different dynamics. For such multiple IoT devices, there is no guarantee that an agent who interacts only with one IoT device and learns the optimal control policy will also control another IoT device well. Therefore, we may need to apply independent reinforcement learning to each IoT device individually, which requires a costly or time-consuming effort. To solve this problem, we propose a new federated reinforcement learning architecture where each agent working on its independent IoT device shares their learning experience (i.e., the gradient of loss function) with each other, and transfers a mature policy model parameters into other agents. They accelerate its learning process by using mature parameters. We incorporate the actor-critic proximal policy optimization (Actor-Critic PPO) algorithm into each agent in the proposed collaborative architecture and propose an efficient procedure for the gradient sharing and the model transfer. Using multiple rotary inverted pendulum devices interconnected via a network switch, we demonstrate that the proposed federated reinforcement learning scheme can effectively facilitate the learning process for multiple IoT devices and that the learning speed can be faster if more agents are involved.

引用

页数：15

共 34 条

[1]

[Anonymous], P 34 INT C MACH LEAR

[2]

[Anonymous], P KICS FALL C 2019 S

[3]

[Anonymous], 2018, REINFORCEMENT LEARNI

[4]

[Anonymous], P ICLR 2018 WORKSH V

[5]

[Anonymous], 2016, P INT C MACH LEARN N

[6]

[Anonymous], ARXIV160205629

[7]

[Anonymous], QUBE SERV 2

[8] Reinforcement Learning-Based Control of Nonlinear Systems Using Lyapunov Stability Concept and Fuzzy Reward Scheme [J].

Chen, Ming ;

Lam, Hak Keung ;

Shi, Qian ;

Xiao, Bo .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2020, 67 (10) :2059-2063

[9]

Cohen J., 1988, Statistical power analysis for the behavioral sciences, V2nd

[10]

Glatt R, 2016, PROCEEDINGS OF 2016 5TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS 2016), P91, DOI [10.1109/BRACIS.2016.027, 10.1109/BRACIS.2016.17]

← 1 2 3 4 →