Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:17
|
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 42 条
  • [11] ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments
    Shen, Zhaoyan
    Tang, Qingxiang
    Zhou, Tianren
    Zhang, Yuhao
    Jia, Zhiping
    Yu, Dongxiao
    Zhang, Zhiyong
    Li, Bingzhe
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (01) : 30 - 43
  • [12] BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster
    Yang, Eunju
    Kang, Dong-Ki
    Youn, Chan-Hyun
    JOURNAL OF SUPERCOMPUTING, 2020, 76 (01): : 47 - 67
  • [13] BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster
    Eunju Yang
    Dong-Ki Kang
    Chan-Hyun Youn
    The Journal of Supercomputing, 2020, 76 : 47 - 67
  • [14] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
    Hasheminezhad, Bita
    Shirzad, Shahrzad
    Wu, Nanmiao
    Diehl, Patrick
    Schulz, Hannes
    Kaiser, Hartmut
    PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
  • [15] BigDL: A Distributed Deep Learning Framework for Big Data
    Dai, Jason
    Wang, Yiheng
    Qiu, Xin
    Ding, Ding
    Zhang, Yao
    Wang, Yanzhang
    Jia, Xianyan
    Zhang, Cherry
    Wan, Yan
    Li, Zhichao
    Wang, Jiao
    Huang, Shengsheng
    Wu, Zhongyuan
    Wang, Yang
    Yang, Yuhao
    She, Bowen
    Shi, Dongjie
    Lu, Qi
    Huang, Kai
    Song, Guoqiong
    PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, : 50 - 60
  • [16] Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments
    Du, Haizhou
    Huang, Sheng
    Xiang, Qiao
    PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2022 (CF 2022), 2022, : 181 - 184
  • [17] Instance segmentation on distributed deep learning big data cluster
    Mohammed Elhmadany
    Islam Elmadah
    Hossam E. Abdelmunim
    Journal of Big Data, 11
  • [18] Instance segmentation on distributed deep learning big data cluster
    Elhmadany, Mohammed
    Elmadah, Islam
    Abdelmunim, Hossam E.
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [19] Decentralized Distributed Multi-institutional PET Image Segmentation Using a Federated Deep Learning Framework
    Shiri, Isaac
    Sadr, Alireza Vafaei
    Amini, Mehdi
    Salimi, Yazdan
    Sanaat, Amirhossein
    Akhavanallaf, Azadeh
    Razeghi, Behrooz
    Ferdowsi, Sohrab
    Saberi, Abdollah
    Arabi, Hossein
    Becker, Minerva
    Voloshynovskiy, Slava
    Gunduz, Deniz
    Rahmim, Arman
    Zaidi, Habib
    CLINICAL NUCLEAR MEDICINE, 2022, 47 (07) : 606 - 617
  • [20] A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning
    Jia, Danlin
    Yuan, Geng
    Xie, Yiming
    Lin, Xue
    Mi, Ningfang
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2024, 21 (04)