Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引：1

作者：

Du, Haizhou ^{[1
]}

Huang, Sheng ^{[1
]}

Xiang, Qiao ^{[2
]}

机构：

[1] Shanghai Univ Elect Power, Shanghai, Peoples R China

[2] Xiamen Univ, Xiamen, Peoples R China

来源：

PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022) | 2022年

关键词：

Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;

D O I：

10.1145/3528416.3530246

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.

引用

页码：181 / 184

页数：4

共 50 条

[41] Lightweight distributed deep learning on compressive measurements for internet of things
Hu, Guiqiang
Hu, Yong
Wu, Tao
Zhang, Yushu
Yuan, Shuai
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 139
[42] TensorExpress: In-Network Communication Scheduling for Distributed Deep Learning
Kang, Minkoo
Yang, Gyeongsik
Yoo, Yeonho
Yoo, Chuck
2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 25 - 27
[43] Communication Optimization Algorithms for Distributed Deep Learning Systems: A Survey
Yu, Enda
Dong, Dezun
Liao, Xiangke
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (12) : 3294 - 3308
[44] A Systematic Review of Distributed Deep Learning Frameworks for Big Data
Berloco, Francesco
Bevilacqua, Vitoantonio
Colucci, Simona
INTELLIGENT COMPUTING METHODOLOGIES, PT III, 2022, 13395 : 242 - 256
[45] Scalable Malware Detection System Using Distributed Deep Learning
Kumar, Manish
CYBERNETICS AND SYSTEMS, 2023, 54 (05) : 619 - 647
[46] Near-Optimal Sparse Allreduce for Distributed Deep Learning
Li, Shigang
Hoefler, Torsten
PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, : 135 - 149
[47] AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning
Khaleghian, Salman
Ullah, Habib
Johnsen, Einar Broch
Andersen, Anders
Marinoni, Andrea
IEEE ACCESS, 2022, 10 : 84569 - 84578
[48] Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training
Liu, Ting
Miao, Tianhao
Wu, Qinghua
Li, Zhenyu
He, Guangxin
Wu, Jiaoren
Zhang, Shengzhuo
Yang, Xingwu
Tyson, Gareth
Xie, Gaogang
PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 1764 - 1773
[49] Deep Learning for Distributed Optimization: Applications to Wireless Resource Management
Lee, Hoon
Lee, Sang Hyun
Quek, Tony Q. S.
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2019, 37 (10) : 2251 - 2266
[50] Quantum distributed deep learning architectures: Models, discussions, and applications✩
Kwak, Yunseok
Yun, Won Joon
Kim, Jae Pyoung
Cho, Hyunhee
Park, Jihong
Choi, Minseok
Jung, Soyi
Kim, Joongheon
ICT EXPRESS, 2023, 9 (03): : 486 - 491

← 1 2 3 4 5 →