Joint Dynamic Grouping and Gradient Coding for Time-Critical Distributed Machine Learning in Heterogeneous Edge Networks

被引:1
|
作者
Mao, Yingchi [1 ,2 ]
Wu, Jun [2 ]
He, Xiaoming [2 ]
Ping, Ping [1 ,2 ]
Wang, Jiajun [2 ]
Wu, Jie [3 ]
机构
[1] Minist Water Resources, Key Lab Water Big Data Technol, Nanjing 210098, Peoples R China
[2] Hohai Univ, Coll Comp & Informat, Nanjing 211100, Peoples R China
[3] Temple Univ, Ctr Networked Comp, Philadelphia, PA 19122 USA
基金
中国国家自然科学基金;
关键词
Training; Encoding; Computational modeling; Delays; Convergence; Servers; Data models; Dynamic grouping; gradient coding (GC); gradient compression; heterogeneous edge networks; STRAGGLERS;
D O I
10.1109/JIOT.2022.3182394
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In edge networks, distributed computing resources have been widely utilized to collaboratively perform a machine learning task by multiple nodes. However, the model training time in heterogeneous edge networks is becoming longer because of excessive computation and delay caused by slow nodes, namely, stragglers. The parameter server even abandons stragglers which fail to return the outcome within a reasonable deadline, called straggler dropout, decreasing the model accuracy. To optimize the computation cost and maintain the model accuracy, we focus on mitigating the heavy computation of stragglers and preventing straggler dropout. Therefore, we propose a novel scheme named dynamic grouping and heterogeneity-aware gradient coding (DGH-GC) to tolerate stragglers by employing dynamic grouping and gradient coding. DGH-GC evenly distributes stragglers in each group and encodes gradients based on their computation capacity to prevent them drop out. However, DGH-GC exacerbates the communication burden by making data duplication to tolerate stragglers. Relying on the scheme, we further propose an algorithm called DGH-(GC)2 to compress transferred gradients in both upstream communication and downstream communication. Experimental evaluations prove that DGH-(GC) outperforms all state-of-the-art methods and DGH-(GC)2 further speeds up the convergence time of the trained model and saves about 26% average iteration time compared to the DGH-(GC).
引用
收藏
页码:22723 / 22736
页数:14
相关论文
共 10 条
  • [1] Communication Optimization in Heterogeneous Edge Networks Using Dynamic Grouping and Gradient Coding
    Mao, Yingchi
    Wu, Jun
    He, Xiaoming
    Ping, Ping
    Huang, Jianxin
    WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS, PT III, 2022, 13473 : 47 - 58
  • [2] Joint Coding and Scheduling Optimization for Distributed Learning Over Wireless Edge Networks
    Nguyen Van Huynh
    Dinh Thai Hoang
    Nguyen, Diep N.
    Dutkiewicz, Eryk
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2022, 40 (02) : 484 - 498
  • [3] Joint Data Collection and Resource Allocation for Distributed Machine Learning at the Edge
    Chen, Min
    Wang, Haichuan
    Meng, Zeyu
    Xu, Hongli
    Xu, Yang
    Liu, Jianchun
    Huang, He
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2022, 21 (08) : 2876 - 2894
  • [4] Design and Optimization of Hierarchical Gradient Coding for Distributed Learning at Edge Devices
    Tang, Weiheng
    Li, Jingyi
    Chen, Lin
    Chen, Xu
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2024, 72 (12) : 7727 - 7741
  • [5] Dynamic Scheduling for Heterogeneous Federated Learning in Private 5G Edge Networks
    Guo, Kun
    Chen, Zihan
    Yang, Howard H.
    Quek, Tony Q. S.
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (01) : 26 - 40
  • [6] From Federated to Fog Learning: Distributed Machine Learning over Heterogeneous Wireless Networks
    Hosseinalipour, Seyyedali
    Brinton, Christopher G.
    Aggarwal, Vaneet
    Dai, Huaiyu
    Chiang, Mung
    IEEE COMMUNICATIONS MAGAZINE, 2020, 58 (12) : 41 - 47
  • [7] Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic Networks
    Fan, Lang
    Zhang, Xiaoning
    Zhao, Yangming
    Sood, Keshav
    Yu, Shui
    IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2024, 10 (01) : 277 - 291
  • [8] Energy-Efficient Computation Offloading With DVFS Using Deep Reinforcement Learning for Time-Critical IoT Applications in Edge Computing
    Panda, Saroj Kumar
    Lin, Man
    Zhou, Ti
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (08) : 6611 - 6621
  • [9] AI in 6G: Energy-Efficient Distributed Machine Learning for Multilayer Heterogeneous Networks
    Hossain, Mohammad Arif
    Hossain, Abdullah Ridwan
    Ansari, Nirwan
    IEEE NETWORK, 2022, 36 (06): : 84 - 91
  • [10] Edge/Cloud Infinite-Time Horizon Resource Allocation for Distributed Machine Learning and General Tasks
    Sartzetakis, Ippokratis
    Soumplis, Polyzois
    Pantazopoulos, Panagiotis
    Katsaros, Konstantinos V.
    Sourlas, Vasilis
    Varvarigos, Emmanouel
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2024, 21 (01): : 697 - 713