Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引:1
|
作者
Du, Haizhou [1 ]
Huang, Sheng [1 ]
Xiang, Qiao [2 ]
机构
[1] Shanghai Univ Elect Power, Shanghai, Peoples R China
[2] Xiamen Univ, Xiamen, Peoples R China
来源
PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年
关键词
Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;
D O I
10.1145/3528416.3530246
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.
引用
收藏
页码:181 / 184
页数:4
相关论文
共 50 条
  • [1] ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments
    Shen, Zhaoyan
    Tang, Qingxiang
    Zhou, Tianren
    Zhang, Yuhao
    Jia, Zhiping
    Yu, Dongxiao
    Zhang, Zhiyong
    Li, Bingzhe
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (01) : 30 - 43
  • [2] Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems
    Lee, Jaehwan
    Choi, Hyeonseong
    Jeong, Hyeonwoo
    Noh, Baekhyeon
    Shin, Ji Sun
    APPLIED SCIENCES-BASEL, 2020, 10 (24): : 1 - 15
  • [3] Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning
    Peng, Jing
    Shi, Shaohuai
    Li, Zihan
    Li, Bo
    53RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2024, 2024, : 148 - 157
  • [4] Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning
    Lee, Hochan
    Lee, Jaewook
    Kim, Heewon
    Pack, Sangheon
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (06) : 4198 - 4204
  • [5] Self-aware distributed deep learning framework for heterogeneous IoT edge devices
    Jin, Yi
    Cai, Jiawei
    Xu, Jiawei
    Huan, Yuxiang
    Yan, Yulong
    Huang, Bin
    Guo, Yongliang
    Zheng, Lirong
    Zou, Zhuo
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 125 : 908 - 920
  • [6] Communication-Efficient Distributed Deep Learning with GPU-FPGA Heterogeneous Computing
    Tanaka, Kenji
    Arikawa, Yuki
    Ito, Tsuyoshi
    Morita, Kazutaka
    Nemoto, Naru
    Miura, Fumiaki
    Terada, Kazuhiko
    Teramoto, Junji
    Sakamoto, Takeshi
    2020 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2020), 2020, : 43 - 46
  • [7] Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
    Youngrang Kim
    Hyeonseong Choi
    Jaehwan Lee
    Jik-Soo Kim
    Hyunseung Jei
    Hongchan Roh
    Cluster Computing, 2020, 23 : 2287 - 2300
  • [8] Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
    Kim, Youngrang
    Choi, Hyeonseong
    Lee, Jaehwan
    Kim, Jik-Soo
    Jei, Hyunseung
    Roh, Hongchan
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (03): : 2287 - 2300
  • [9] Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning
    Gulhane, Radha
    Anthony, Quentin
    Shafi, Aamir
    Subramoni, Hari
    Panda, Dhabaleswar K.
    PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2024, PEARC 2024, 2024,
  • [10] Generalized Likelihood Ratio Test for Distributed Targets in Heterogeneous Environments
    Shang, Xiuqin
    Song, Hongjun
    2010 IEEE 10TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS (ICSP2010), VOLS I-III, 2010, : 2242 - 2245