Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-in-Memory Architectures

被引:73
作者
Peng, Xiaochen [1 ]
Liu, Rui [2 ,3 ]
Yu, Shimeng [1 ]
机构
[1] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
[2] Arizona State Univ, Sch Elect Comp & Energy Engn, Tempe, AZ 85281 USA
[3] Design Grp Synopsys, Mountain View, CA 94043 USA
关键词
Convolutional neural network; processing in memory; hardware accelerator; resistive random access memory;
D O I
10.1109/TCSI.2019.2958568
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent state-of-the-art deep convolutional neural networks (CNNs) have shown remarkable success in current intelligent systems for various tasks, such as image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engines based on processing-in-memory (PIM) architecture, where the memory array is used for weighted sum computation, thereby avoiding the frequent data transfer between buffers and computation units. Prior PIM designs typically unroll each 3D kernel of the convolutional layers into a vertical column of a large weight matrix, where the input data needs to be accessed multiple times. In this paper, in order to maximize both weight and input data reuse for PIM architecture, we propose a novel weight mapping method and the corresponding data flow which divides the kernels and assign the input data into different processing-elements (PEs) according to their spatial locations. As a case study, resistive random access memory (RRAM) based 8-bit PIM design at 32 nm is benchmarked. The proposed mapping method and data flow yields similar to 2.03x speed up and similar to 1.4x improvement in throughput and energy efficiency for ResNet-34, compared with the prior design based on the conventional mapping method. To further optimize the hardware performance and throughput, we propose an optimal pipeline architecture, with similar to 50% area overhead, it achieves overall 913x and 1.96x improvement in throughput and energy efficiency, which are 132476 FPS and 20.1 TOPS/W, respectively.
引用
收藏
页码:1333 / 1343
页数:11
相关论文
共 30 条
[1]  
[Anonymous], 2019, ARXIV190902384
[2]   NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning [J].
Chen, Pai-Yu ;
Peng, Xiaochen ;
Yu, Shimeng .
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (12) :3067-3080
[3]  
Chen PY, 2016, IEEE INT SYMP CIRC S, P2310, DOI 10.1109/ISCAS.2016.7539046
[4]  
Chen PY, 2015, DES AUT TEST EUROPE, P854
[5]   Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Krishna, Tushar ;
Emer, Joel S. ;
Sze, Vivienne .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2017, 52 (01) :127-138
[6]   DaDianNao: A Machine-Learning Supercomputer [J].
Chen, Yunji ;
Luo, Tao ;
Liu, Shaoli ;
Zhang, Shijin ;
He, Liqiang ;
Wang, Jia ;
Li, Ling ;
Chen, Tianshi ;
Xu, Zhiwei ;
Sun, Ninghui ;
Temam, Olivier .
2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, :609-622
[7]   PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory [J].
Chi, Ping ;
Li, Shuangchen ;
Xu, Cong ;
Zhang, Tao ;
Zhao, Jishen ;
Liu, Yongpan ;
Wang, Yu ;
Xie, Yuan .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :27-39
[8]  
Chou CC, 2018, ISSCC DIG TECH PAP I, P478
[9]   Programming Protocol Optimization for Analog Weight Tuning in Resistive Memories [J].
Gao, Ligang ;
Chen, Pai-Yu ;
Yu, Shimeng .
IEEE ELECTRON DEVICE LETTERS, 2015, 36 (11) :1157-1159
[10]   Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices [J].
Gokmen, Tayfun ;
Onen, Murat ;
Haensch, Wilfried .
FRONTIERS IN NEUROSCIENCE, 2017, 11