Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引:1
|
作者
Du, Haizhou [1 ]
Huang, Sheng [1 ]
Xiang, Qiao [2 ]
机构
[1] Shanghai Univ Elect Power, Shanghai, Peoples R China
[2] Xiamen Univ, Xiamen, Peoples R China
来源
PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年
关键词
Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;
D O I
10.1145/3528416.3530246
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.
引用
收藏
页码:181 / 184
页数:4
相关论文
共 50 条
  • [41] Lightweight distributed deep learning on compressive measurements for internet of things
    Hu, Guiqiang
    Hu, Yong
    Wu, Tao
    Zhang, Yushu
    Yuan, Shuai
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 139
  • [42] TensorExpress: In-Network Communication Scheduling for Distributed Deep Learning
    Kang, Minkoo
    Yang, Gyeongsik
    Yoo, Yeonho
    Yoo, Chuck
    2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 25 - 27
  • [43] Communication Optimization Algorithms for Distributed Deep Learning Systems: A Survey
    Yu, Enda
    Dong, Dezun
    Liao, Xiangke
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (12) : 3294 - 3308
  • [44] A Systematic Review of Distributed Deep Learning Frameworks for Big Data
    Berloco, Francesco
    Bevilacqua, Vitoantonio
    Colucci, Simona
    INTELLIGENT COMPUTING METHODOLOGIES, PT III, 2022, 13395 : 242 - 256
  • [45] Scalable Malware Detection System Using Distributed Deep Learning
    Kumar, Manish
    CYBERNETICS AND SYSTEMS, 2023, 54 (05) : 619 - 647
  • [46] Near-Optimal Sparse Allreduce for Distributed Deep Learning
    Li, Shigang
    Hoefler, Torsten
    PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, : 135 - 149
  • [47] AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning
    Khaleghian, Salman
    Ullah, Habib
    Johnsen, Einar Broch
    Andersen, Anders
    Marinoni, Andrea
    IEEE ACCESS, 2022, 10 : 84569 - 84578
  • [48] Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training
    Liu, Ting
    Miao, Tianhao
    Wu, Qinghua
    Li, Zhenyu
    He, Guangxin
    Wu, Jiaoren
    Zhang, Shengzhuo
    Yang, Xingwu
    Tyson, Gareth
    Xie, Gaogang
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 1764 - 1773
  • [49] Deep Learning for Distributed Optimization: Applications to Wireless Resource Management
    Lee, Hoon
    Lee, Sang Hyun
    Quek, Tony Q. S.
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2019, 37 (10) : 2251 - 2266
  • [50] Quantum distributed deep learning architectures: Models, discussions, and applications✩
    Kwak, Yunseok
    Yun, Won Joon
    Kim, Jae Pyoung
    Cho, Hyunhee
    Park, Jihong
    Choi, Minseok
    Jung, Soyi
    Kim, Joongheon
    ICT EXPRESS, 2023, 9 (03): : 486 - 491