A Quick Survey on Large Scale Distributed Deep Learning Systems

被引：0

作者：

Zhang, Zhaoning ^{[1
]}

Yin, Lujia ^{[1
]}

Peng, Yuxing ^{[1
]}

Li, Dongsheng ^{[1
]}

机构：

[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Lab, Changsha, Hunan, Peoples R China

来源：

2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018) | 2018年

关键词：

Deep Learning; Distributed Systems; Large Scale;

D O I：

10.1109/ICPADS.2018.00142

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning have been widely used in various fields and has worked very well as a major role. While the gradual penetration into various fields, data quantity of each applications is increasing tremendously, and so as the computation complexity and model parameters. As an obvious result, the training and inference is time consuming. For example, a classic Resnet50 classification model will be trained in 14 days on a NVIDIA M40 GPU with ImageNet data set. Thus, distributed acceleration is a very useful way to dispatch the computation of training and even inference to scale of nodes in parallel and accelerate the whole process. Facebook's work and UC Berkeley's acceleration can training the Resnet-50 model within hour and minutes by distributed deep learning algorithm and system, representatively. As other distributed accelerations, it gives a possibility to accelerate large models on large data sets from weeks to minutes, which gives researchers and developers more space to explore and search. However, besides acceleration, what other issues will be confronted of the distributed deep learning system? Where is the upper limit of acceleration? What application will acceleration be used for? What is the price and cost of acceleration? In this paper, we will take a simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective. We will present several recent excellent works, and bring analysis on the restricts and prospects of the distributed methods.

引用

页码：1052 / 1056

页数：5

共 50 条

[1] A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed Systems
Pazho, Armin Danesh
Noghre, Ghazal Alinezhad
Purkayastha, Arnab A.
Vempati, Jagannadh
Martin, Otto
Tabkhi, Hamed
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (01) : 1 - 20
[2] Large-Scale Deep Learning for Building Intelligent Computer Systems
Dean, Jeff
PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, : 1 - 1
[3] Straggler-Aware Gradient Aggregation for Large-Scale Distributed Deep Learning System
Li, Yijun
Huang, Jiawei
Li, Zhaoyi
Liu, Jingling
Zhou, Shengwen
Zhang, Tao
Jiang, Wanchun
Wang, Jianxin
IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (06) : 4917 - 4930
[4] A Survey on Techniques for Improving the Energy Efficiency of Large-Scale Distributed Systems
Orgerie, Anne-Cecile
De Assuncao, Marcos Dias
Lefevre, Laurent
ACM COMPUTING SURVEYS, 2014, 46 (04)
[5] Designing Reconfigurable Large-Scale Deep Learning Systems Using Stochastic Computing
Ren, Ao
Li, Zhe
Wang, Yanzhi
Qiu, Qinru
Yuan, Bo
2016 IEEE INTERNATIONAL CONFERENCE ON REBOOTING COMPUTING (ICRC), 2016,
[6] Resilience in Large Scale Distributed Systems
Matni, Nikolai
Leong, Yoke Peng
Wang, Yuh Shyang
You, Seungil
Horowitz, Matanya B.
Doyle, John C.
2014 CONFERENCE ON SYSTEMS ENGINEERING RESEARCH, 2014, 28 : 285 - 293
[7] Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
Giang Nguyen
Stefan Dlugolinsky
Martin Bobák
Viet Tran
Álvaro López García
Ignacio Heredia
Peter Malík
Ladislav Hluchý
Artificial Intelligence Review, 2019, 52 : 77 - 124
[8] Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
Nguyen, Giang
Dlugolinsky, Stefan
Bobak, Martin
Viet Tran
Lopez Garcia, Alvaro
Heredia, Ignacio
Malik, Peter
Hluchy, Ladislav
ARTIFICIAL INTELLIGENCE REVIEW, 2019, 52 (01) : 77 - 124
[9] Private and Secure Distributed Deep Learning: A Survey
Allaart, Corinne
Amiri, Saba
Bal, Henri
Belloum, Adam
Gommans, Leon
van Halteren, Aart
Klous, Sander
ACM COMPUTING SURVEYS, 2025, 57 (04)
[10] Survey on Network of Distributed Deep Learning Training
Zhu H.
Yuan G.
Yao C.
Tan G.
Wang Z.
Hu Z.
Zhang X.
An X.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 98 - 115

← 1 2 3 4 5 →