SingleCaffe: An Efficient Framework for Deep Learning on a Single Node

被引：2

作者：

Wang, Chenxu ^{[1
,2
]}

Shen, Yixian ^{[2
,3
]}

Jia, Jia ^{[4
]}

Lu, Yutong ^{[2
,3
]}

Chen, Zhiguang ^{[2
,3
]}

Wang, Bo ^{[5
]}

机构：

[1] Natl Univ Def Technol, Sch Comp Sci, Changsha 410073, Hunan, Peoples R China

[2] Natl Supercomp Ctr Guangzhou, Guangzhou 510006, Guangdong, Peoples R China

[3] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou 510006, Guangdong, Peoples R China

[4] Beijing Special Engn & Design Inst, Beijing 100028, Peoples R China

[5] State Key Lab Math Engn & Adv Comp, Zhengzhou 450002, Henan, Peoples R China

来源：

IEEE ACCESS | 2018年 / 6卷

基金：

美国国家科学基金会; 国家重点研发计划; 中国国家自然科学基金;

关键词：

Deep learning; framework; single node; multiple threads; speed up; data parallelism; parameter server;

D O I：

10.1109/ACCESS.2018.2879877

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning (DL) is currently the most promising approach in complicated applications such as computer vision and natural language processing. It thrives with large neural networks and large datasets. However, larger models and larger datasets result in longer training times that impede research and development progress. The modern high-performance and data-parallel nature of hardware equipped with high computing power, such as GPUs, has triggered the widespread adoption of such hardware in DL frameworks, such as Caffe, Torch, and TensorFlow. However, most DL frameworks cannot make full use of this high-performance hardware, and computational efficiency is low. In this paper, we present SingleCaffe(1), a DL framework that can make full use of such hardware and improve the computational efficiency of the training process. SingleCaffe opens up multiple threads to speed up the training process within a single node and adopts data parallelism on multiple threads. During the training process, SingleCaffe selects a thread as a parameter server thread and the other threads as worker threads. Both data and workloads are distributed across worker threads, while the server thread maintains the globally shared parameters. The framework also manages memory allocation carefully to reduce the memory overhead. The experimental results show that SingleCaffe can improve training efficiency well, and the performance on a single node can even achieve the distributed training of a dozen nodes.

引用

页码：69660 / 69671

页数：12

共 42 条

[31] FireCaffe: near-linear acceleration of deep neural network training on compute clusters [J].

Iandola, Forrest N. ;

Moskewicz, Matthew W. ;

Ashraf, Khalid ;

Keutzer, Kurt .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2592-2600

[32] Caffe: Convolutional Architecture for Fast Feature Embedding [J].

Jia, Yangqing ;

Shelhamer, Evan ;

Donahue, Jeff ;

Karayev, Sergey ;

Long, Jonathan ;

Girshick, Ross ;

Guadarrama, Sergio ;

Darrell, Trevor .

PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :675-678

[33] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[34]

Le Q. V., 2011, P 28 INT C MACH LEAR, P265

[35] Gradient-based learning applied to document recognition [J].

Lecun, Y ;

Bottou, L ;

Bengio, Y ;

Haffner, P .

PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2278-2324

[36]

Qiao A, 2018, PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, P631

[37]

Simonyan K., 2015, P 3 INT C LEARNING R

[38]

Szegedy C, 2015, PROC CVPR IEEE, P1, DOI 10.1109/CVPR.2015.7298594

[39]

Vishnu Abhinav, 2016, DISTRIBUTED TENSORFL

[40]

Wu Ren, 2015, DEEP IMAGE SCALING I

← 1 2 3 4 5 →