Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引：17

作者：

Kim, Youngrang ^{[1
]}

Choi, Hyeonseong ^{[1
]}

Lee, Jaehwan ^{[1
]}

Kim, Jik-Soo ^{[2
]}

Jei, Hyunseung ^{[3
]}

Roh, Hongchan ^{[3
]}

机构：

[1] Korea Aerosp Univ, Goyang Si, South Korea

[2] Myongji Univ, Yongin, South Korea

[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea

来源：

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期

基金：

新加坡国家研究基金会;

关键词：

Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;

D O I：

10.1007/s10586-020-03144-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.

引用

页码：2287 / 2300

页数：14

共 42 条

[1] Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
Youngrang Kim
Hyeonseong Choi
Jaehwan Lee
Jik-Soo Kim
Hyunseung Jei
Hongchan Roh
Cluster Computing, 2020, 23 : 2287 - 2300
[2] Efficient Large-scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster
Kim, Youngrang
Choi, Hyeonseong
Lee, Jaehwan
Kim, Jik-Soo
Jei, Hyunseung
Roh, Hongchan
2019 IEEE 4TH INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W 2019), 2019, : 176 - 181
[3] Pipe-torch: Pipeline-Based Distributed Deep Learning in a GPU Cluster with Heterogeneous Networking
Zhan, Jun
Zhang, Jinghui
2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 55 - 60
[4] Communication-Efficient Distributed Deep Learning with GPU-FPGA Heterogeneous Computing
Tanaka, Kenji
Arikawa, Yuki
Ito, Tsuyoshi
Morita, Kazutaka
Nemoto, Naru
Miura, Fumiaki
Terada, Kazuhiko
Teramoto, Junji
Sakamoto, Takeshi
2020 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2020), 2020, : 43 - 46
[5] LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster
Yao, Feixiang
Zhang, Zhonghao
Ji, Zeyu
Liu, Bin
Gao, Haoyuan
JOURNAL OF SUPERCOMPUTING, 2024, 80 (09): : 12247 - 12272
[6] Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness
Li, Qingping
Xu, Jingwei
Cao, Chun
THE 12TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2020, 2021, : 217 - 228
[7] Optimizing execution for pipelined-based distributed deep learning in a heterogeneously networked GPU cluster
Zhang, Jinghui
Zhan, Jun
Li, Jiange
Jin, Jiahui
Qian, Lei
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (23):
[8] Self-aware distributed deep learning framework for heterogeneous IoT edge devices
Jin, Yi
Cai, Jiawei
Xu, Jiawei
Huan, Yuxiang
Yan, Yulong
Huang, Bin
Guo, Yongliang
Zheng, Lirong
Zou, Zhuo
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 125 : 908 - 920
[9] Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment
Ravikumar, Aswathy
Sriraman, Harini
IEEE ACCESS, 2023, 11 : 45559 - 45571
[10] Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters
Kim, HyungJun
Song, Chunggeon
Lee, HwaMin
Yu, Heonchang
2023 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, ICCE, 2023,

← 1 2 3 4 5 →