Exploiting potential of deep neural networks by layer-wise fine-grained parallelism

被引:5
作者
Jiang, Wenbin [1 ]
Zhang, Yangsong [1 ]
Liu, Pai [1 ]
Peng, Jing [1 ]
Yang, Laurence T. [2 ,3 ]
Ye, Geyan [1 ]
Jin, Hai [1 ]
机构
[1] Huazhong Univ Sci & Technol, Natl Engn Res Ctr Big Data Technol & Syst, Sch Comp Sci & Technol, Serv Comp Technol & Syst Lab,Cluster & Grid Comp, Wuhan 430074, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Cyber Phys Social Syst Lab, Wuhan 430074, Hubei, Peoples R China
[3] St Francis Xavier Univ, Dept Comp Sci, Antigonish, NS, Canada
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2020年 / 102卷
基金
中国国家自然科学基金;
关键词
Deep learning; Fine-grained parallelism; CUDA stream;
D O I
10.1016/j.future.2019.07.054
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Deep neural networks (DNNs) have become more and more important for big data analysis. They usually use data parallelism or model parallelism for extreme scale computing. However, the two approaches realize the performance improvement mainly by using coarse-grained parallelization schemes. Neither can fully exploit the potentials of the parallelism of many-core systems (such as GPUs) for neural network models. Here, a new fine grained parallelism strategy (named FiLayer) is presented based on layer-wise parallelization. It has two components: inter-layer parallelism and intra-layer parallelism. The inter-layer parallelism makes several neighboring layers be processed by using a pipeline manner in a network model. For intra-layer parallelism, the operations in one layer are separated into several parts and processed concurrently. To implement above fine-grained parallelism methods, CUDA streams are used. A mathematical analysis is presented for the influence of fragment number on performance of the inter-layer parallelism, and also an analysis for the influence of CUDA stream number on the performance of the intra-layer parallelism is given. The proposed approach is realized based on Caffe. Some representative datasets including CIFAR100 and ImageNet, are applied for experiments. The evaluation results show that it can help Caffe realize remarkable speedups, which makes much sense to big data analysis. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:210 / 221
页数:12
相关论文
共 30 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
[Anonymous], 2018, ARXIV180603377
[3]  
[Anonymous], 2013, NIPS
[4]  
[Anonymous], 2014, ARXIV NEURAL EVOLUTI
[5]  
[Anonymous], 2010, CUDA by example: an introduction to programming
[6]  
[Anonymous], 2011, P 15 C COMPUTATIONAL
[7]  
[Anonymous], ARXIV170602677
[8]  
Awan AA, 2017, ACM SIGPLAN NOTICES, V52, P193, DOI [10.1145/3155284.3018769, 10.1145/3018743.3018769]
[9]   High Prevalence of Assisted Injection Among Street-Involved Youth in a Canadian Setting [J].
Cheng, Tessa ;
Kerr, Thomas ;
Small, Will ;
Dong, Huiru ;
Montaner, Julio ;
Wood, Evan ;
DeBeck, Kora .
AIDS AND BEHAVIOR, 2016, 20 (02) :377-384
[10]  
Collobert R, 2011, J MACH LEARN RES, V12, P2493