A Quick Survey on Large Scale Distributed Deep Learning Systems

被引:0
|
作者
Zhang, Zhaoning [1 ]
Yin, Lujia [1 ]
Peng, Yuxing [1 ]
Li, Dongsheng [1 ]
机构
[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Lab, Changsha, Hunan, Peoples R China
来源
2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018) | 2018年
关键词
Deep Learning; Distributed Systems; Large Scale;
D O I
10.1109/ICPADS.2018.00142
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning have been widely used in various fields and has worked very well as a major role. While the gradual penetration into various fields, data quantity of each applications is increasing tremendously, and so as the computation complexity and model parameters. As an obvious result, the training and inference is time consuming. For example, a classic Resnet50 classification model will be trained in 14 days on a NVIDIA M40 GPU with ImageNet data set. Thus, distributed acceleration is a very useful way to dispatch the computation of training and even inference to scale of nodes in parallel and accelerate the whole process. Facebook's work and UC Berkeley's acceleration can training the Resnet-50 model within hour and minutes by distributed deep learning algorithm and system, representatively. As other distributed accelerations, it gives a possibility to accelerate large models on large data sets from weeks to minutes, which gives researchers and developers more space to explore and search. However, besides acceleration, what other issues will be confronted of the distributed deep learning system? Where is the upper limit of acceleration? What application will acceleration be used for? What is the price and cost of acceleration? In this paper, we will take a simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective. We will present several recent excellent works, and bring analysis on the restricts and prospects of the distributed methods.
引用
收藏
页码:1052 / 1056
页数:5
相关论文
共 50 条
  • [1] A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed Systems
    Pazho, Armin Danesh
    Noghre, Ghazal Alinezhad
    Purkayastha, Arnab A.
    Vempati, Jagannadh
    Martin, Otto
    Tabkhi, Hamed
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (01) : 1 - 20
  • [2] Large-Scale Deep Learning for Building Intelligent Computer Systems
    Dean, Jeff
    PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, : 1 - 1
  • [3] Straggler-Aware Gradient Aggregation for Large-Scale Distributed Deep Learning System
    Li, Yijun
    Huang, Jiawei
    Li, Zhaoyi
    Liu, Jingling
    Zhou, Shengwen
    Zhang, Tao
    Jiang, Wanchun
    Wang, Jianxin
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (06) : 4917 - 4930
  • [4] A Survey on Techniques for Improving the Energy Efficiency of Large-Scale Distributed Systems
    Orgerie, Anne-Cecile
    De Assuncao, Marcos Dias
    Lefevre, Laurent
    ACM COMPUTING SURVEYS, 2014, 46 (04)
  • [5] Designing Reconfigurable Large-Scale Deep Learning Systems Using Stochastic Computing
    Ren, Ao
    Li, Zhe
    Wang, Yanzhi
    Qiu, Qinru
    Yuan, Bo
    2016 IEEE INTERNATIONAL CONFERENCE ON REBOOTING COMPUTING (ICRC), 2016,
  • [6] Resilience in Large Scale Distributed Systems
    Matni, Nikolai
    Leong, Yoke Peng
    Wang, Yuh Shyang
    You, Seungil
    Horowitz, Matanya B.
    Doyle, John C.
    2014 CONFERENCE ON SYSTEMS ENGINEERING RESEARCH, 2014, 28 : 285 - 293
  • [7] Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
    Giang Nguyen
    Stefan Dlugolinsky
    Martin Bobák
    Viet Tran
    Álvaro López García
    Ignacio Heredia
    Peter Malík
    Ladislav Hluchý
    Artificial Intelligence Review, 2019, 52 : 77 - 124
  • [8] Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
    Nguyen, Giang
    Dlugolinsky, Stefan
    Bobak, Martin
    Viet Tran
    Lopez Garcia, Alvaro
    Heredia, Ignacio
    Malik, Peter
    Hluchy, Ladislav
    ARTIFICIAL INTELLIGENCE REVIEW, 2019, 52 (01) : 77 - 124
  • [9] Private and Secure Distributed Deep Learning: A Survey
    Allaart, Corinne
    Amiri, Saba
    Bal, Henri
    Belloum, Adam
    Gommans, Leon
    van Halteren, Aart
    Klous, Sander
    ACM COMPUTING SURVEYS, 2025, 57 (04)
  • [10] Survey on Network of Distributed Deep Learning Training
    Zhu H.
    Yuan G.
    Yao C.
    Tan G.
    Wang Z.
    Hu Z.
    Zhang X.
    An X.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 98 - 115