Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引:1
|
作者
Du, Haizhou [1 ]
Huang, Sheng [1 ]
Xiang, Qiao [2 ]
机构
[1] Shanghai Univ Elect Power, Shanghai, Peoples R China
[2] Xiamen Univ, Xiamen, Peoples R China
来源
PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年
关键词
Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;
D O I
10.1145/3528416.3530246
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.
引用
收藏
页码:181 / 184
页数:4
相关论文
共 50 条
  • [21] Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters
    Kim, HyungJun
    Song, Chunggeon
    Lee, HwaMin
    Yu, Heonchang
    2023 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, ICCE, 2023,
  • [22] Multivariate LSTM for Execution Time Prediction in HPC for Distributed Deep Learning Training
    Assali, Tasnim
    Trabelsi Ayoub, Zayneb
    Ouni, Sofiane
    2024 IEEE 27TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING, ISORC 2024, 2024,
  • [23] Performance enhancement in hybrid SDN using advanced deep learning with multi-objective optimization frameworks under heterogeneous environments
    Bishla, Deepak
    Kumar, Brijesh
    INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2025, 38 (03)
  • [24] Communication compression techniques in distributed deep learning: A survey
    Wang, Zeqin
    Wen, Ming
    Xu, Yuedong
    Zhou, Yipeng
    Wang, Jessie Hui
    Zhang, Liang
    JOURNAL OF SYSTEMS ARCHITECTURE, 2023, 142
  • [25] Impact of data set noise on distributed deep learning
    Qinghao G.
    Liguo S.
    Sunying H.
    Journal of China Universities of Posts and Telecommunications, 2020, 27 (02): : 37 - 45
  • [26] BigDL: A Distributed Deep Learning Framework for Big Data
    Dai, Jason
    Wang, Yiheng
    Qiu, Xin
    Ding, Ding
    Zhang, Yao
    Wang, Yanzhang
    Jia, Xianyan
    Zhang, Cherry
    Wan, Yan
    Li, Zhichao
    Wang, Jiao
    Huang, Shengsheng
    Wu, Zhongyuan
    Wang, Yang
    Yang, Yuhao
    She, Bowen
    Shi, Dongjie
    Lu, Qi
    Huang, Kai
    Song, Guoqiong
    PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, : 50 - 60
  • [27] Evaluation and Optimization of Gradient Compression for Distributed Deep Learning
    Zhang, Lin
    Zhang, Longteng
    Shi, Shaohuai
    Chu, Xiaowen
    Li, Bo
    2023 IEEE 43RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS, 2023, : 361 - 371
  • [28] File Access Patterns of Distributed Deep Learning Applications
    Parraga, Edixon
    Leon, Betzabeth
    Mendez, Sandra
    Rexachs, Dolores
    Luque, Emilio
    CLOUD COMPUTING, BIG DATA & EMERGING TOPICS, JCC-BD&ET 2022, 2022, 1634 : 3 - 19
  • [29] GSP Distributed Deep Learning Used for the Monitoring System
    Pan, Zhongming
    Luo, Yigui
    Sha, Wei
    Xie, Yin
    2021 IEEE 6TH INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS (ICBDA 2021), 2021, : 224 - 229
  • [30] Membership Mappings for Practical Secure Distributed Deep Learning
    Kumar, Mohit
    Zhang, Weiping
    Fischer, Lukas
    Freudenthaler, Bernhard
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2023, 31 (08) : 2617 - 2631