Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:18
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 43 条
[11]   ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments [J].
Shen, Zhaoyan ;
Tang, Qingxiang ;
Zhou, Tianren ;
Zhang, Yuhao ;
Jia, Zhiping ;
Yu, Dongxiao ;
Zhang, Zhiyong ;
Li, Bingzhe .
IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (01) :30-43
[12]   BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster [J].
Eunju Yang ;
Dong-Ki Kang ;
Chan-Hyun Youn .
The Journal of Supercomputing, 2020, 76 :47-67
[13]   BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster [J].
Yang, Eunju ;
Kang, Dong-Ki ;
Youn, Chan-Hyun .
JOURNAL OF SUPERCOMPUTING, 2020, 76 (01) :47-67
[14]   Towards a Scalable and Distributed Infrastructure for Deep Learning Applications [J].
Hasheminezhad, Bita ;
Shirzad, Shahrzad ;
Wu, Nanmiao ;
Diehl, Patrick ;
Schulz, Hannes ;
Kaiser, Hartmut .
PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, :20-30
[15]   BigDL: A Distributed Deep Learning Framework for Big Data [J].
Dai, Jason ;
Wang, Yiheng ;
Qiu, Xin ;
Ding, Ding ;
Zhang, Yao ;
Wang, Yanzhang ;
Jia, Xianyan ;
Zhang, Cherry ;
Wan, Yan ;
Li, Zhichao ;
Wang, Jiao ;
Huang, Shengsheng ;
Wu, Zhongyuan ;
Wang, Yang ;
Yang, Yuhao ;
She, Bowen ;
Shi, Dongjie ;
Lu, Qi ;
Huang, Kai ;
Song, Guoqiong .
PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, :50-60
[16]   Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments [J].
Du, Haizhou ;
Huang, Sheng ;
Xiang, Qiao .
PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022), 2022, :181-184
[17]   Instance segmentation on distributed deep learning big data cluster [J].
Mohammed Elhmadany ;
Islam Elmadah ;
Hossam E. Abdelmunim .
Journal of Big Data, 11
[18]   Instance segmentation on distributed deep learning big data cluster [J].
Elhmadany, Mohammed ;
Elmadah, Islam ;
Abdelmunim, Hossam E. .
JOURNAL OF BIG DATA, 2024, 11 (01)
[19]   Decentralized Distributed Multi-institutional PET Image Segmentation Using a Federated Deep Learning Framework [J].
Shiri, Isaac ;
Sadr, Alireza Vafaei ;
Amini, Mehdi ;
Salimi, Yazdan ;
Sanaat, Amirhossein ;
Akhavanallaf, Azadeh ;
Razeghi, Behrooz ;
Ferdowsi, Sohrab ;
Saberi, Abdollah ;
Arabi, Hossein ;
Becker, Minerva ;
Voloshynovskiy, Slava ;
Gunduz, Deniz ;
Rahmim, Arman ;
Zaidi, Habib .
CLINICAL NUCLEAR MEDICINE, 2022, 47 (07) :606-617
[20]   A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning [J].
Jia, Danlin ;
Yuan, Geng ;
Xie, Yiming ;
Lin, Xue ;
Mi, Ningfang .
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2024, 21 (04)