Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:17
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 42 条
  • [31] A communication-efficient distributed deep learning remote sensing image change detection framework
    Cheng, Hongquan
    Zheng, Jie
    Wu, Huayi
    Qi, Kunlun
    He, Lihua
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 129
  • [32] Distributed Deep Learning at the Edge: A Novel Proactive and Cooperative Caching Framework for Mobile Edge Networks
    Saputra, Yuris Mulya
    Dinh Thai Hoang
    Nguyen, Diep N.
    Dutkiewicz, Eryk
    Niyato, Dusit
    Kim, Dong In
    [J]. IEEE WIRELESS COMMUNICATIONS LETTERS, 2019, 8 (04) : 1220 - 1223
  • [33] Poster: A Novel Shared Memory Framework for Distributed Deep Learning in High-Performance Computing Architecture
    Ahn, Shinyoung
    Kim, Joongheon
    Kang, Sungwon
    [J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 191 - 192
  • [34] Qore-DL: A QoS-aware joint optimization framework for distributed deep learning training
    Hua, Qin
    Qian, Shiyou
    Yang, Dingyu
    Guo, Jianmei
    Cao, Jian
    Xue, Guangtao
    Li, Minglu
    [J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2022, 130
  • [35] Distributed Training and Inference of Deep Learning Models for Multi-Modal Land Cover Classification
    Aspri, Maria
    Tsagkatakis, Grigorios
    Tsakalides, Panagiotis
    [J]. REMOTE SENSING, 2020, 12 (17)
  • [36] E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster
    Sultana, Abeda
    Chen, Li
    Xu, Fei
    Yuan, Xu
    [J]. PROCEEDINGS OF THE 49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2020, 2020,
  • [37] Blockchain-escorted distributed deep learning with collaborative model aggregation towards 6G networks
    Ma, Zhaowei
    Yuan, Xiaoming
    Liang, Kai
    Feng, Jie
    Zhu, Li
    Zhang, Dajun
    Yu, F. Richard
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 141 : 555 - 566
  • [38] Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads
    Truong Thao Nguyen
    Wahib, Mohamed
    Takano, Ryousei
    [J]. 2018 SIXTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING WORKSHOPS (CANDARW 2018), 2018, : 216 - 222
  • [39] Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters
    Cho, Yonghyeon
    Kim, Yoochan
    Kim, Kihyun
    Kim, Jinwoo
    Kim, Hong-Yeon
    Kim, Youngjae
    [J]. IEEE ACCESS, 2024, 12 : 116891 - 116904
  • [40] US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning
    Gao, Yunqi
    Hu, Bing
    Mashhadi, Mahdi Boloursaz
    Jin, A-Long
    Xiao, Pei
    Wu, Chunming
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (01) : 123 - 139