Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:17
|
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 42 条
  • [21] A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning
    Jia, Danlin
    Yuan, Geng
    Xie, Yiming
    Lin, Xue
    Mi, Ningfang
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2024, 21 (04)
  • [22] Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
    Awan, Ammar Ahmad
    Subramoni, Hari
    Chu, Ching-Hsiang
    Panda, Dhabaleswar K.
    EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
  • [23] Distributed Deep Learning Framework based on Shared Memory for Fast Deep Neural Network Training
    Lim, Eun-Ji
    Ahn, Shin-Young
    Park, Yoo-Mi
    Choi, Wan
    2018 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2018, : 1239 - 1242
  • [24] Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server
    Zhou, Qihua
    Wang, Kun
    Guo, Song
    Lu, Haodong
    Li, Li
    Guo, Minyi
    Sun, Yanfei
    2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 196 - 206
  • [25] Distributed Deep Learning for Multi-Label Chest Radiography Classification
    Monshi, Maram Mahmoud A.
    Poon, Josiah
    Chung, Vera
    PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4, 2022, : 949 - 956
  • [26] Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework
    Rojas, Elvis
    Knobloch, Michael
    Daoud, Nour
    Meneses, Esteban
    Mohr, Bernd
    2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), 2022, : 516 - 522
  • [27] A Dynamic Sliding Window Based Tensor Communication Scheduling Framework for Distributed Deep Learning
    Gao, Yunqi
    Hu, Bing
    Mashhadi, Mahdi Boloursaz
    Wang, Wei
    Tafazolli, Rahim
    Debbah, Merouane
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2025, 12 (02): : 1080 - 1095
  • [28] Multi-Switch Cooperative In-Network Aggregation for Distributed Deep Learning
    Su, Ming-Wei
    Li, Yuan-Yu
    Lin, Kate Ching-Ju
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 4767 - 4772
  • [29] Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
    Luo, Yizhou
    Wang, Qiang
    Shi, Shaohuai
    Lai, Jiaxin
    Qi, Shuhan
    Zhang, Jiajia
    Wang, Xuan
    2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
  • [30] Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters
    Liu, Kaiyang
    Wang, Jingrong
    Huang, Zhiming
    Pan, Jianping
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (06) : 874 - 888