On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems

被引:60
作者
Choi, Wonje [1 ]
Duraisamy, Karthi [1 ]
Kim, Ryan Gary [2 ]
Doppa, Janardhan Rao [1 ]
Pande, Partha Pratim [1 ]
Marculescu, Diana [2 ]
Marculescu, Radu [2 ]
机构
[1] Washington State Univ, Elect Engn & Comp Sci, Pullman, WA 99164 USA
[2] Carnegie Mellon Univ, ECE, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
System-on-chip; Deep learning; Manycore systems; Wireless communication; Energy-efficient computing; Heterogeneous Architectures; Network-on-Chip; DESIGN SPACE EXPLORATION; INTEGRATED ANTENNAS; NEURAL-NETWORKS; OPTIMIZATION; NOC; INTERCONNECTION;
D O I
10.1109/TC.2017.2777863
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Convolutional Neural Networks (CNNs) have shown a great deal of success in diverse application domains including computer vision, speech recognition, and natural language processing. However, as the size of datasets and the depth of neural network architectures continue to grow, it is imperative to design high-performance and energy-efficient computing hardware for training CNNs. In this paper, we consider the problem of designing specialized CPU-GPU based heterogeneous manycore systems for energy-efficient training of CNNs. It has already been shown that the typical on-chip communication infrastructures employed in conventional CPU-GPU based heterogeneous manycore platforms are unable to handle both CPU and GPU communication requirements efficiently. To address this issue, we first analyze the on-chip traffic patterns that arise from the computational processes associated with training two deep CNN architectures, namely, LeNet and CDBNet, to perform image classification. By leveraging this knowledge, we design a hybrid Network-on-Chip (NoC) architecture, which consists of both wireline and wireless links, to improve the performance of CPU-GPU based heterogeneous manycore platforms running the above-mentioned CNN training workloads. The proposed NoC achieves 1.8x reduction in network latency and improves the network throughput by a factor of 2.2 for training CNNs, when compared to a highly-optimized wireline mesh NoC. For the considered CNN workloads, these network-level improvements translate into 25 percent savings in full-system energy-delay-product (EDP). This demonstrates that the proposed hybrid NoC for heterogeneous manycore architectures is capable of significantly accelerating training of CNNs while remaining energy-efficient.
引用
收藏
页码:672 / 686
页数:15
相关论文
共 49 条
[1]  
Abadi M., 2015, TensorFlow: Large-scale machine learning on heterogeneous systems.
[2]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[3]   GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator [J].
Agarwal, Niket ;
Krishna, Tushar ;
Peh, Li-Shiuan ;
Jha, Niraj K. .
ISPASS 2009: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2009, :33-42
[4]  
[Anonymous], 2013, SIGARCH Comput. Archit. News, DOI [DOI 10.1145/2508148.2485964, 10.1145/2508148.2485964, DOI 10.1145/2485922]
[5]  
[Anonymous], 2013, IEEE T PATTERN ANAL, DOI DOI 10.1109/TPAMI.2012.59
[6]  
[Anonymous], P INT C COMP ARCH SY
[7]  
[Anonymous], 2009, Learning multiple layers of features from tiny images
[8]  
[Anonymous], P 47 INT S MICR
[9]  
[Anonymous], P IEEE ACM MICRO
[10]  
[Anonymous], 1998, The mnist database of handwritten digits