Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引:1
|
作者
Du, Haizhou [1 ]
Huang, Sheng [1 ]
Xiang, Qiao [2 ]
机构
[1] Shanghai Univ Elect Power, Shanghai, Peoples R China
[2] Xiamen Univ, Xiamen, Peoples R China
来源
PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年
关键词
Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;
D O I
10.1145/3528416.3530246
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.
引用
收藏
页码:181 / 184
页数:4
相关论文
共 50 条
  • [31] Prediction of the Resource Consumption of Distributed Deep Learning Systems
    Yang, Gyeongsik
    Shin, Changyong
    Lee, Jeunghwan
    Yoo, Yeonho
    Yoo, Chuck
    PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS, 2022, 6 (02)
  • [32] Comparative Study of Distributed Deep Learning Tools on Supercomputers
    Du, Xin
    Kuang, Di
    Ye, Yan
    Li, Xinxin
    Chen, Mengqiang
    Du, Yunfei
    Wu, Weigang
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2018, PT I, 2018, 11334 : 122 - 137
  • [33] ALADDIN: Asymmetric Centralized Training for Distributed Deep Learning
    Ko, Yunyong
    Choi, Kibong
    Jei, Hyunseung
    Lee, Dongwon
    Kim, Sang-Wook
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 863 - 872
  • [34] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
    Hasheminezhad, Bita
    Shirzad, Shahrzad
    Wu, Nanmiao
    Diehl, Patrick
    Schulz, Hannes
    Kaiser, Hartmut
    PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
  • [35] Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements
    Veroneze Solorzano, Ana Luisa
    Schnorr, Lucas Mello
    HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022, 2022, 13289 : 275 - 292
  • [36] A blockchain-enabled learning model based on distributed deep learning architecture
    Zhang, Yang
    Liang, Yongquan
    Jia, Bin
    Wang, Pinxiang
    Zhang, Xiaosong
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (09) : 6577 - 6604
  • [37] SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning
    Ko, Yunyong
    Kim, Sang-Wook
    APPLIED SCIENCES-BASEL, 2022, 12 (01):
  • [38] Dynamic layer-wise sparsification for distributed deep learning
    Zhang, Hao
    Wu, Tingting
    Ma, Zhifeng
    Li, Feng
    Liu, Jie
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 147 : 1 - 15
  • [39] Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning
    Zhao, Xing
    An, Aijun
    Liu, Junfeng
    Chen, Bao Xin
    2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 1507 - 1517
  • [40] Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning
    Zhao, Xing
    Papagelis, Manos
    An, Aijun
    Chen, Bao Xin
    Liu, Junfeng
    Hu, Yonggang
    2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 1504 - 1509