CoopFL: Accelerating federated learning with DNN partitioning and offloading in heterogeneous edge computing

被引：17

作者：

Wang, Zhiyuan ^{[1
]}

Xu, Hongli ^{[1
]}

Xu, Yang ^{[1
]}

Jiang, Zhida ^{[1
]}

Liu, Jianchun ^{[1
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei, Peoples R China

来源：

COMPUTER NETWORKS | 2023年 / 220卷

基金：

美国国家科学基金会;

关键词：

Federated learning; Model partitioning; Offloading; Edge computing; SYSTEMS; CLOUD;

D O I：

10.1016/j.comnet.2022.109490

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Federated learning (FL), a novel distributed machine learning (DML) approach, has been widely adopted to train deep neural networks (DNNs), over massive data in edge computing. However, the existing FL systems often lead to a long training time due to resource limitation and system heterogeneity (e.g., computing, communication and memory) in edge computing. To this end, we design and implement an FL system, called CoopFL, which trains DNNs through the cooperation between devices and edge servers. Specifically, we implement DNN partitioning and offloading techniques in CoopFL, which enables each device to train a partial DNN model and offload the intermediate data outputted by some hidden layers to proper edge servers for cooperative training. However, some empirical partitioning and offloading strategies in previous works may not exploit the system resource well or even slow down the training procedure. To this end, we give a problem definition considering the resource constraints and system heterogeneity, and then propose an efficient algorithm to solve this problem so as to accelerate the training procedure by the developed DNN partitioning and offloading strategy. Extensive experiments on the classical models and datasets show the high effectiveness of our system. For example, CoopFL achieves a speedup of 2.3-4.9x, compared with the baselines, including hierarchical federated learning system (HFL), typical federated learning system (TFL), and two systems with empirical DNN partitioning, i.e., FedMEC and HFLP.

引用

页数：17

共 78 条

[1]

Abad MSH, 2020, INT CONF ACOUST SPEE, P8866, DOI [10.1109/ICASSP40776.2020.9054634, 10.1109/icassp40776.2020.9054634]

[2] Deep Learning with Differential Privacy [J].

Abadi, Martin ;

Chu, Andy ;

Goodfellow, Ian ;

McMahan, H. Brendan ;

Mironov, Ilya ;

Talwar, Kunal ;

Zhang, Li .

CCS'16: PROCEEDINGS OF THE 2016 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2016, :308-318

[3] List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table [J].

Arabnejad, Hamid ;

Barbosa, Jorge G. .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (03) :682-694

[4]

Bonawitz K., 2019, ARXIV

[5] Large-Scale Machine Learning with Stochastic Gradient Descent [J].

Bottou, Leon .

COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, :177-186

[6] SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters [J].

Cao, Jing ;

Zhu, Zongwei ;

Zhou, Xuehai .

2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, :94-102

[7] A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks [J].

Chen, Mingzhe ;

Yang, Zhaohui ;

Saad, Walid ;

Yin, Changchuan ;

Poor, H. Vincent ;

Cui, Shuguang .

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2021, 20 (01) :269-283

[8] Energy-Efficient Offloading for DNN-Based Smart IoT Systems in Cloud-Edge Environments [J].

Chen, Xing ;

Zhang, Jianshan ;

Lin, Bing ;

Chen, Zheyi ;

Wolter, Katinka ;

Min, Geyong .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (03) :683-697

[9]

Cohen G, 2017, IEEE IJCNN, P2921, DOI 10.1109/IJCNN.2017.7966217

[10]

Eliad S, 2021, PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE, P381

← 1 2 3 4 5 6 7 8 →