Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:18
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 43 条
[21]   Performance Comparision of TPU, GPU, CPU on Google Colaboratory over Distributed Deep Learning [J].
Kimm, Haklin ;
Paik, Incheon ;
Kimm, Hanke .
2021 IEEE 14TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC 2021), 2021, :312-319
[22]   Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? [J].
Awan, Ammar Ahmad ;
Subramoni, Hari ;
Chu, Ching-Hsiang ;
Panda, Dhabaleswar K. .
EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
[23]   Distributed Deep Learning Framework based on Shared Memory for Fast Deep Neural Network Training [J].
Lim, Eun-Ji ;
Ahn, Shin-Young ;
Park, Yoo-Mi ;
Choi, Wan .
2018 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2018, :1239-1242
[24]   Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server [J].
Zhou, Qihua ;
Wang, Kun ;
Guo, Song ;
Lu, Haodong ;
Li, Li ;
Guo, Minyi ;
Sun, Yanfei .
2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, :196-206
[25]   Distributed Deep Learning for Multi-Label Chest Radiography Classification [J].
Monshi, Maram Mahmoud A. ;
Poon, Josiah ;
Chung, Vera .
PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4, 2022, :949-956
[26]   Multi-Switch Cooperative In-Network Aggregation for Distributed Deep Learning [J].
Su, Ming-Wei ;
Li, Yuan-Yu ;
Lin, Kate Ching-Ju .
IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, :4767-4772
[27]   Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework [J].
Rojas, Elvis ;
Knobloch, Michael ;
Daoud, Nour ;
Meneses, Esteban ;
Mohr, Bernd .
2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), 2022, :516-522
[28]   A Dynamic Sliding Window Based Tensor Communication Scheduling Framework for Distributed Deep Learning [J].
Gao, Yunqi ;
Hu, Bing ;
Mashhadi, Mahdi Boloursaz ;
Wang, Wei ;
Tafazolli, Rahim ;
Debbah, Merouane .
IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2025, 12 (02) :1080-1095
[29]   Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing [J].
Luo, Yizhou ;
Wang, Qiang ;
Shi, Shaohuai ;
Lai, Jiaxin ;
Qi, Shuhan ;
Zhang, Jiajia ;
Wang, Xuan .
2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
[30]   Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters [J].
Liu, Kaiyang ;
Wang, Jingrong ;
Huang, Zhiming ;
Pan, Jianping .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (06) :874-888