A Quick Survey on Large Scale Distributed Deep Learning Systems

被引：0

作者：

Zhang, Zhaoning ^{[1
]}

Yin, Lujia ^{[1
]}

Peng, Yuxing ^{[1
]}

Li, Dongsheng ^{[1
]}

机构：

[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Lab, Changsha, Hunan, Peoples R China

来源：

2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018) | 2018年

关键词：

Deep Learning; Distributed Systems; Large Scale;

D O I：

10.1109/ICPADS.2018.00142

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning have been widely used in various fields and has worked very well as a major role. While the gradual penetration into various fields, data quantity of each applications is increasing tremendously, and so as the computation complexity and model parameters. As an obvious result, the training and inference is time consuming. For example, a classic Resnet50 classification model will be trained in 14 days on a NVIDIA M40 GPU with ImageNet data set. Thus, distributed acceleration is a very useful way to dispatch the computation of training and even inference to scale of nodes in parallel and accelerate the whole process. Facebook's work and UC Berkeley's acceleration can training the Resnet-50 model within hour and minutes by distributed deep learning algorithm and system, representatively. As other distributed accelerations, it gives a possibility to accelerate large models on large data sets from weeks to minutes, which gives researchers and developers more space to explore and search. However, besides acceleration, what other issues will be confronted of the distributed deep learning system? Where is the upper limit of acceleration? What application will acceleration be used for? What is the price and cost of acceleration? In this paper, we will take a simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective. We will present several recent excellent works, and bring analysis on the restricts and prospects of the distributed methods.

引用

页码：1052 / 1056

页数：5

共 50 条

[41] Energy Efficient Resource Allocation in Large Scale Distributed Systems
Lee, Young Choon
Zomaya, Albert Y.
PROCEEDINGS OF THE NINTH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES 2010), 2010, : 580 - 583
[42] Large scale survey for radio propagation in developing machine learning model for path losses in communication systems
Chiroma, Haruna
Nickolas, Ponman
Faruk, Nasir
Alozie, Emmanuel
Olayinka, Imam-Fulani Yusuf
Adewole, Kayode S.
Abdulkarim, Abubakar
Oloyede, Abdulkarim A.
Sowande, Olugbenga A.
Garba, Salisu
Usman, Aliyu D.
Taura, Lawan S.
Adediran, Yinusa A.
SCIENTIFIC AFRICAN, 2023, 19
[43] Machine Learning and Deep Learning Methods for Intrusion Detection Systems: A Survey
Liu, Hongyu
Lang, Bo
APPLIED SCIENCES-BASEL, 2019, 9 (20):
[44] Approximate to Be Great: Communication Efficient and Privacy-Preserving Large-Scale Distributed Deep Learning in Internet of Things
Du, Wei
Li, Ang
Zhou, Pan
Xu, Zichuan
Wang, Xiumin
Jiang, Hao
Wu, Dapeng
IEEE INTERNET OF THINGS JOURNAL, 2020, 7 (12) : 11678 - 11692
[45] Large Scale Landmark Recognition via Deep Metric Learning
Boiarov, Andrei
Tyantov, Eduard
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 169 - 178
[46] Hierarchical Heterogeneous Cluster Systems for Scalable Distributed Deep Learning
Wang, Yibo
Geng, Tongsheng
Silva, Ericson
Gaudiot, Jean-Luc
2024 IEEE 27TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING, ISORC 2024, 2024,
[47] iRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems
Ren, Yufei
Wu, Xingbo
Zhang, Li
Wang, Yandong
Zhang, Wei
Wang, Zijun
Hack, Michel
Jiang, Song
2017 19TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS (HPCC) / 2017 15TH IEEE INTERNATIONAL CONFERENCE ON SMART CITY (SMARTCITY) / 2017 3RD IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (DSS), 2017, : 231 - 238
[48] Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems
Wang, Zixuan
Sim, Joonseop
Lim, Euicheol
Zhao, Jishen
2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), 2022, : 126 - 140
[49] Lightweight Deep Learning Based Channel Estimation for Extremely Large-Scale Massive MIMO Systems
Gao, Shen
Dong, Peihao
Pan, Zhiwen
You, Xiaohu
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2024, 73 (07) : 10750 - 10754
[50] Accuracy and Generalization of Deep Learning Applied to Large Scale Attacks
Freas, Christopher B.
Shah, Dhara
Harrison, Robert W.
2021 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS (ICC WORKSHOPS), 2021,

← 1 2 3 4 5 →