A Massively Parallel Coprocessor for Convolutional Neural Networks

被引:155
作者
Sankaradas, Murugan [1 ]
Jakkula, Venkata [1 ]
Cadambi, Srihari [1 ]
Chakradhar, Srimat [1 ]
Durdanovic, Igor [1 ]
Cosatto, Eric [1 ]
Graf, Hans Peter [1 ]
机构
[1] NEC Labs Amer Inc, Princeton, NJ USA
来源
2009 20TH IEEE INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS | 2009年
关键词
D O I
10.1109/ASAP.2009.25
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a "meta-operator" to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm's simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCl FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.
引用
收藏
页码:53 / 60
页数:8
相关论文
共 13 条
[1]  
[Anonymous], 1998, NEURAL NETWORKS TRIC
[2]   Design and implementation of a 2D convolution core for video applications on FPGAs [J].
Benkrid, K ;
Belkacemi, SD .
THIRD INTERNATIONAL WORKSHOP ON DIGITAL AND COMPUTATIONAL VIDEO, PROCEEDINGS, 2002, :85-92
[3]   Area-efficient 2-D shift-variant convolvers for FPGA-based digital image processing [J].
Cardells-Tormo, F ;
Molinet, PL .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2006, 53 (02) :105-109
[4]  
CATANZARO B, 2008, MACH LEARN 25 INT C
[5]  
COLLOBERT R, 2008, P 25 INT C MACH LEAR, V307, P160
[6]   A neural network FPGA implementation [J].
Coric, S ;
Latinovic, I ;
Pavasovic, A .
NEUREL 2000: PROCEEDINGS OF THE 5TH SEMINAR ON NEURAL NETWORK APPLICATIONS IN ELECTRICAL ENGINEERING, 2000, :117-120
[7]  
DURDANOVIC I, LARGE SCALE KERNEL M, P105
[8]   FPGA implementation of a pipeflned on-line backpropagation [J].
Gironés, RG ;
Palero, RC ;
Boluda, JC ;
Cortés, AS .
JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2005, 40 (02) :189-213
[9]   Gradient-based learning applied to document recognition [J].
Lecun, Y ;
Bottou, L ;
Bengio, Y ;
Haffner, P .
PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2278-2324
[10]   The impact of arithmetic representation on implementing MLP-BP on FPGAs: A study [J].
Savich, Antony W. ;
Moussa, Medhat ;
Areibi, Shawki .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2007, 18 (01) :240-252