Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引：1

作者：

Du, Haizhou ^{[1
]}

Huang, Sheng ^{[1
]}

Xiang, Qiao ^{[2
]}

机构：

[1] Shanghai Univ Elect Power, Shanghai, Peoples R China

[2] Xiamen Univ, Xiamen, Peoples R China

来源：

PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年

关键词：

Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;

D O I：

10.1145/3528416.3530246

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.

引用

页码：181 / 184

页数：4

共 50 条

[31] Prediction of the Resource Consumption of Distributed Deep Learning Systems
Yang, Gyeongsik
Shin, Changyong
Lee, Jeunghwan
Yoo, Yeonho
Yoo, Chuck
PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS, 2022, 6 (02)
[32] Comparative Study of Distributed Deep Learning Tools on Supercomputers
Du, Xin
Kuang, Di
Ye, Yan
Li, Xinxin
Chen, Mengqiang
Du, Yunfei
Wu, Weigang
ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2018, PT I, 2018, 11334 : 122 - 137
[33] ALADDIN: Asymmetric Centralized Training for Distributed Deep Learning
Ko, Yunyong
Choi, Kibong
Jei, Hyunseung
Lee, Dongwon
Kim, Sang-Wook
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 863 - 872
[34] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
Hasheminezhad, Bita
Shirzad, Shahrzad
Wu, Nanmiao
Diehl, Patrick
Schulz, Hannes
Kaiser, Hartmut
PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
[35] Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements
Veroneze Solorzano, Ana Luisa
Schnorr, Lucas Mello
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022, 2022, 13289 : 275 - 292
[36] A blockchain-enabled learning model based on distributed deep learning architecture
Zhang, Yang
Liang, Yongquan
Jia, Bin
Wang, Pinxiang
Zhang, Xiaosong
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (09) : 6577 - 6604
[37] SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning
Ko, Yunyong
Kim, Sang-Wook
APPLIED SCIENCES-BASEL, 2022, 12 (01):
[38] Dynamic layer-wise sparsification for distributed deep learning
Zhang, Hao
Wu, Tingting
Ma, Zhifeng
Li, Feng
Liu, Jie
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 147 : 1 - 15
[39] Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning
Zhao, Xing
An, Aijun
Liu, Junfeng
Chen, Bao Xin
2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 1507 - 1517
[40] Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning
Zhao, Xing
Papagelis, Manos
An, Aijun
Chen, Bao Xin
Liu, Junfeng
Hu, Yonggang
2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 1504 - 1509

← 1 2 3 4 5 →