Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:17
|
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 42 条
  • [1] Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
    Youngrang Kim
    Hyeonseong Choi
    Jaehwan Lee
    Jik-Soo Kim
    Hyunseung Jei
    Hongchan Roh
    Cluster Computing, 2020, 23 : 2287 - 2300
  • [2] Efficient Large-scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster
    Kim, Youngrang
    Choi, Hyeonseong
    Lee, Jaehwan
    Kim, Jik-Soo
    Jei, Hyunseung
    Roh, Hongchan
    2019 IEEE 4TH INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W 2019), 2019, : 176 - 181
  • [3] Pipe-torch: Pipeline-Based Distributed Deep Learning in a GPU Cluster with Heterogeneous Networking
    Zhan, Jun
    Zhang, Jinghui
    2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 55 - 60
  • [4] Communication-Efficient Distributed Deep Learning with GPU-FPGA Heterogeneous Computing
    Tanaka, Kenji
    Arikawa, Yuki
    Ito, Tsuyoshi
    Morita, Kazutaka
    Nemoto, Naru
    Miura, Fumiaki
    Terada, Kazuhiko
    Teramoto, Junji
    Sakamoto, Takeshi
    2020 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2020), 2020, : 43 - 46
  • [5] LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster
    Yao, Feixiang
    Zhang, Zhonghao
    Ji, Zeyu
    Liu, Bin
    Gao, Haoyuan
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (09): : 12247 - 12272
  • [6] Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness
    Li, Qingping
    Xu, Jingwei
    Cao, Chun
    THE 12TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2020, 2021, : 217 - 228
  • [7] Optimizing execution for pipelined-based distributed deep learning in a heterogeneously networked GPU cluster
    Zhang, Jinghui
    Zhan, Jun
    Li, Jiange
    Jin, Jiahui
    Qian, Lei
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (23):
  • [8] Self-aware distributed deep learning framework for heterogeneous IoT edge devices
    Jin, Yi
    Cai, Jiawei
    Xu, Jiawei
    Huan, Yuxiang
    Yan, Yulong
    Huang, Bin
    Guo, Yongliang
    Zheng, Lirong
    Zou, Zhuo
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 125 : 908 - 920
  • [9] Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment
    Ravikumar, Aswathy
    Sriraman, Harini
    IEEE ACCESS, 2023, 11 : 45559 - 45571
  • [10] Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters
    Kim, HyungJun
    Song, Chunggeon
    Lee, HwaMin
    Yu, Heonchang
    2023 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, ICCE, 2023,