Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引：17

作者：

Kim, Youngrang ^{[1
]}

Choi, Hyeonseong ^{[1
]}

Lee, Jaehwan ^{[1
]}

Kim, Jik-Soo ^{[2
]}

Jei, Hyunseung ^{[3
]}

Roh, Hongchan ^{[3
]}

机构：

[1] Korea Aerosp Univ, Goyang Si, South Korea

[2] Myongji Univ, Yongin, South Korea

[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea

来源：

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期

基金：

新加坡国家研究基金会;

关键词：

Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;

D O I：

10.1007/s10586-020-03144-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.

引用

页码：2287 / 2300

页数：14

共 42 条

[31] A communication-efficient distributed deep learning remote sensing image change detection framework
Cheng, Hongquan
Zheng, Jie
Wu, Huayi
Qi, Kunlun
He, Lihua
[J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 129
[32] Distributed Deep Learning at the Edge: A Novel Proactive and Cooperative Caching Framework for Mobile Edge Networks
Saputra, Yuris Mulya
Dinh Thai Hoang
Nguyen, Diep N.
Dutkiewicz, Eryk
Niyato, Dusit
Kim, Dong In
[J]. IEEE WIRELESS COMMUNICATIONS LETTERS, 2019, 8 (04) : 1220 - 1223
[33] Poster: A Novel Shared Memory Framework for Distributed Deep Learning in High-Performance Computing Architecture
Ahn, Shinyoung
Kim, Joongheon
Kang, Sungwon
[J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 191 - 192
[34] Qore-DL: A QoS-aware joint optimization framework for distributed deep learning training
Hua, Qin
Qian, Shiyou
Yang, Dingyu
Guo, Jianmei
Cao, Jian
Xue, Guangtao
Li, Minglu
[J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2022, 130
[35] Distributed Training and Inference of Deep Learning Models for Multi-Modal Land Cover Classification
Aspri, Maria
Tsagkatakis, Grigorios
Tsakalides, Panagiotis
[J]. REMOTE SENSING, 2020, 12 (17)
[36] E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster
Sultana, Abeda
Chen, Li
Xu, Fei
Yuan, Xu
[J]. PROCEEDINGS OF THE 49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2020, 2020,
[37] Blockchain-escorted distributed deep learning with collaborative model aggregation towards 6G networks
Ma, Zhaowei
Yuan, Xiaoming
Liang, Kai
Feng, Jie
Zhu, Li
Zhang, Dajun
Yu, F. Richard
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 141 : 555 - 566
[38] Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads
Truong Thao Nguyen
Wahib, Mohamed
Takano, Ryousei
[J]. 2018 SIXTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING WORKSHOPS (CANDARW 2018), 2018, : 216 - 222
[39] Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters
Cho, Yonghyeon
Kim, Yoochan
Kim, Kihyun
Kim, Jinwoo
Kim, Hong-Yeon
Kim, Youngjae
[J]. IEEE ACCESS, 2024, 12 : 116891 - 116904
[40] US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning
Gao, Yunqi
Hu, Bing
Mashhadi, Mahdi Boloursaz
Jin, A-Long
Xiao, Pei
Wu, Chunming
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (01) : 123 - 139

← 1 2 3 4 5 →