Joint Dynamic Grouping and Gradient Coding for Time-Critical Distributed Machine Learning in Heterogeneous Edge Networks

被引：1

作者：

Mao, Yingchi ^{[1
,2
]}

Wu, Jun ^{[2
]}

He, Xiaoming ^{[2
]}

Ping, Ping ^{[1
,2
]}

Wang, Jiajun ^{[2
]}

Wu, Jie ^{[3
]}

机构：

[1] Minist Water Resources, Key Lab Water Big Data Technol, Nanjing 210098, Peoples R China

[2] Hohai Univ, Coll Comp & Informat, Nanjing 211100, Peoples R China

[3] Temple Univ, Ctr Networked Comp, Philadelphia, PA 19122 USA

来源：

IEEE INTERNET OF THINGS JOURNAL | 2022年 / 9卷 / 22期

基金：

中国国家自然科学基金;

关键词：

Training; Encoding; Computational modeling; Delays; Convergence; Servers; Data models; Dynamic grouping; gradient coding (GC); gradient compression; heterogeneous edge networks; STRAGGLERS;

D O I：

10.1109/JIOT.2022.3182394

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In edge networks, distributed computing resources have been widely utilized to collaboratively perform a machine learning task by multiple nodes. However, the model training time in heterogeneous edge networks is becoming longer because of excessive computation and delay caused by slow nodes, namely, stragglers. The parameter server even abandons stragglers which fail to return the outcome within a reasonable deadline, called straggler dropout, decreasing the model accuracy. To optimize the computation cost and maintain the model accuracy, we focus on mitigating the heavy computation of stragglers and preventing straggler dropout. Therefore, we propose a novel scheme named dynamic grouping and heterogeneity-aware gradient coding (DGH-GC) to tolerate stragglers by employing dynamic grouping and gradient coding. DGH-GC evenly distributes stragglers in each group and encodes gradients based on their computation capacity to prevent them drop out. However, DGH-GC exacerbates the communication burden by making data duplication to tolerate stragglers. Relying on the scheme, we further propose an algorithm called DGH-(GC)2 to compress transferred gradients in both upstream communication and downstream communication. Experimental evaluations prove that DGH-(GC) outperforms all state-of-the-art methods and DGH-(GC)2 further speeds up the convergence time of the trained model and saves about 26% average iteration time compared to the DGH-(GC).

引用

页码：22723 / 22736

页数：14

共 10 条

[1] Communication Optimization in Heterogeneous Edge Networks Using Dynamic Grouping and Gradient Coding
Mao, Yingchi
Wu, Jun
He, Xiaoming
Ping, Ping
Huang, Jianxin
WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS, PT III, 2022, 13473 : 47 - 58
[2] Joint Coding and Scheduling Optimization for Distributed Learning Over Wireless Edge Networks
Nguyen Van Huynh
Dinh Thai Hoang
Nguyen, Diep N.
Dutkiewicz, Eryk
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2022, 40 (02) : 484 - 498
[3] Joint Data Collection and Resource Allocation for Distributed Machine Learning at the Edge
Chen, Min
Wang, Haichuan
Meng, Zeyu
Xu, Hongli
Xu, Yang
Liu, Jianchun
Huang, He
IEEE TRANSACTIONS ON MOBILE COMPUTING, 2022, 21 (08) : 2876 - 2894
[4] Design and Optimization of Hierarchical Gradient Coding for Distributed Learning at Edge Devices
Tang, Weiheng
Li, Jingyi
Chen, Lin
Chen, Xu
IEEE TRANSACTIONS ON COMMUNICATIONS, 2024, 72 (12) : 7727 - 7741
[5] Dynamic Scheduling for Heterogeneous Federated Learning in Private 5G Edge Networks
Guo, Kun
Chen, Zihan
Yang, Howard H.
Quek, Tony Q. S.
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (01) : 26 - 40
[6] From Federated to Fog Learning: Distributed Machine Learning over Heterogeneous Wireless Networks
Hosseinalipour, Seyyedali
Brinton, Christopher G.
Aggarwal, Vaneet
Dai, Huaiyu
Chiang, Mung
IEEE COMMUNICATIONS MAGAZINE, 2020, 58 (12) : 41 - 47
[7] Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic Networks
Fan, Lang
Zhang, Xiaoning
Zhao, Yangming
Sood, Keshav
Yu, Shui
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2024, 10 (01) : 277 - 291
[8] Energy-Efficient Computation Offloading With DVFS Using Deep Reinforcement Learning for Time-Critical IoT Applications in Edge Computing
Panda, Saroj Kumar
Lin, Man
Zhou, Ti
IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (08) : 6611 - 6621
[9] AI in 6G: Energy-Efficient Distributed Machine Learning for Multilayer Heterogeneous Networks
Hossain, Mohammad Arif
Hossain, Abdullah Ridwan
Ansari, Nirwan
IEEE NETWORK, 2022, 36 (06): : 84 - 91
[10] Edge/Cloud Infinite-Time Horizon Resource Allocation for Distributed Machine Learning and General Tasks
Sartzetakis, Ippokratis
Soumplis, Polyzois
Pantazopoulos, Panagiotis
Katsaros, Konstantinos V.
Sourlas, Vasilis
Varvarigos, Emmanouel
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2024, 21 (01): : 697 - 713

← 1 →