A Quick Survey on Large Scale Distributed Deep Learning Systems

被引:0
|
作者
Zhang, Zhaoning [1 ]
Yin, Lujia [1 ]
Peng, Yuxing [1 ]
Li, Dongsheng [1 ]
机构
[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Lab, Changsha, Hunan, Peoples R China
来源
2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018) | 2018年
关键词
Deep Learning; Distributed Systems; Large Scale;
D O I
10.1109/ICPADS.2018.00142
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning have been widely used in various fields and has worked very well as a major role. While the gradual penetration into various fields, data quantity of each applications is increasing tremendously, and so as the computation complexity and model parameters. As an obvious result, the training and inference is time consuming. For example, a classic Resnet50 classification model will be trained in 14 days on a NVIDIA M40 GPU with ImageNet data set. Thus, distributed acceleration is a very useful way to dispatch the computation of training and even inference to scale of nodes in parallel and accelerate the whole process. Facebook's work and UC Berkeley's acceleration can training the Resnet-50 model within hour and minutes by distributed deep learning algorithm and system, representatively. As other distributed accelerations, it gives a possibility to accelerate large models on large data sets from weeks to minutes, which gives researchers and developers more space to explore and search. However, besides acceleration, what other issues will be confronted of the distributed deep learning system? Where is the upper limit of acceleration? What application will acceleration be used for? What is the price and cost of acceleration? In this paper, we will take a simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective. We will present several recent excellent works, and bring analysis on the restricts and prospects of the distributed methods.
引用
收藏
页码:1052 / 1056
页数:5
相关论文
共 50 条
  • [41] Energy Efficient Resource Allocation in Large Scale Distributed Systems
    Lee, Young Choon
    Zomaya, Albert Y.
    PROCEEDINGS OF THE NINTH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES 2010), 2010, : 580 - 583
  • [42] Large scale survey for radio propagation in developing machine learning model for path losses in communication systems
    Chiroma, Haruna
    Nickolas, Ponman
    Faruk, Nasir
    Alozie, Emmanuel
    Olayinka, Imam-Fulani Yusuf
    Adewole, Kayode S.
    Abdulkarim, Abubakar
    Oloyede, Abdulkarim A.
    Sowande, Olugbenga A.
    Garba, Salisu
    Usman, Aliyu D.
    Taura, Lawan S.
    Adediran, Yinusa A.
    SCIENTIFIC AFRICAN, 2023, 19
  • [43] Machine Learning and Deep Learning Methods for Intrusion Detection Systems: A Survey
    Liu, Hongyu
    Lang, Bo
    APPLIED SCIENCES-BASEL, 2019, 9 (20):
  • [44] Approximate to Be Great: Communication Efficient and Privacy-Preserving Large-Scale Distributed Deep Learning in Internet of Things
    Du, Wei
    Li, Ang
    Zhou, Pan
    Xu, Zichuan
    Wang, Xiumin
    Jiang, Hao
    Wu, Dapeng
    IEEE INTERNET OF THINGS JOURNAL, 2020, 7 (12) : 11678 - 11692
  • [45] Large Scale Landmark Recognition via Deep Metric Learning
    Boiarov, Andrei
    Tyantov, Eduard
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 169 - 178
  • [46] Hierarchical Heterogeneous Cluster Systems for Scalable Distributed Deep Learning
    Wang, Yibo
    Geng, Tongsheng
    Silva, Ericson
    Gaudiot, Jean-Luc
    2024 IEEE 27TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING, ISORC 2024, 2024,
  • [47] iRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems
    Ren, Yufei
    Wu, Xingbo
    Zhang, Li
    Wang, Yandong
    Zhang, Wei
    Wang, Zijun
    Hack, Michel
    Jiang, Song
    2017 19TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS (HPCC) / 2017 15TH IEEE INTERNATIONAL CONFERENCE ON SMART CITY (SMARTCITY) / 2017 3RD IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (DSS), 2017, : 231 - 238
  • [48] Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems
    Wang, Zixuan
    Sim, Joonseop
    Lim, Euicheol
    Zhao, Jishen
    2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), 2022, : 126 - 140
  • [49] Lightweight Deep Learning Based Channel Estimation for Extremely Large-Scale Massive MIMO Systems
    Gao, Shen
    Dong, Peihao
    Pan, Zhiwen
    You, Xiaohu
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2024, 73 (07) : 10750 - 10754
  • [50] Accuracy and Generalization of Deep Learning Applied to Large Scale Attacks
    Freas, Christopher B.
    Shah, Dhara
    Harrison, Robert W.
    2021 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS (ICC WORKSHOPS), 2021,