Low Bit-Width Convolutional Neural Network on RRAM

被引:43
作者
Cai, Yi [1 ]
Tang, Tianqi [2 ]
Xia, Lixue [3 ]
Li, Boxun [1 ]
Wang, Yu [1 ]
Yang, Huazhong [1 ]
机构
[1] Tsinghua Univ, Beijing Innovat Ctr Future Chips, Beijing Natl Res Ctr Informat Sci & Technol, Dept Elect Engn, Beijing 100084, Peoples R China
[2] Univ Calif Santa Barbara, Dept Elect & Comp Engn, Santa Barbara, CA 93106 USA
[3] Alibaba Grp Beijing, Dept Cloud Intelligence, Beijing 100022, Peoples R China
基金
中国国家自然科学基金;
关键词
Pipelines; Training; Neural networks; Resistance; Convolution; Performance evaluation; Neurons; Constrained training; low bit-width convolutional neural network (LB-CNN); parameter splitting; pipeline; resistive random-access memory (RRAM);
D O I
10.1109/TCAD.2019.2917852
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The emerging resistive random-access memory (RRAM) has been widely applied in accelerating the computing of deep neural networks. However, it is challenging to achieve high-precision computations based on RRAM due to the limits of the resistance level and the interfaces. Low bit-width convolutional neural networks (CNNs) provide promising solutions to introduce low bit-width RRAM devices and low bit-width interfaces in RRAM-based computing system (RCS). While open questions still remain regarding: 1) how to make matrix splitting when a single crossbar is not large enough to hold all parameters of one weight matrix; 2) how to design a pipeline to accelerate the inference based on line buffer structure; and 3) how to reduce the accuracy drop due to the parameter splitting and data quantization. In this paper, we propose an RRAM crossbar-based low bit-width CNN (LB-CNN) accelerator. We make detailed discussion on the system design, including the matrix splitting strategies to enhance the scalability, and the pipelined implementation based on line buffers to accelerate the inference. In addition, we propose a splitting and quantizing while training method to incorporate the actual hardware constraints with the training. In our experiments, low bit-width LeNet-5 on RRAM show much better robustness than multibit models with device variation. The pipeline strategy achieves approximately 6.0x speedup to process each image on ResNet-18. For low-bit VGG-8 on CIFAR-10, the proposed accelerator saves 54.9% of the energy consumption and 48.3% of the area compared with the multibit VGG-8 structure.
引用
收藏
页码:1414 / 1427
页数:14
相关论文
共 46 条
  • [1] High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm
    Alibart, Fabien
    Gao, Ligang
    Hoskins, Brian D.
    Strukov, Dmitri B.
    [J]. NANOTECHNOLOGY, 2012, 23 (07)
  • [2] [Anonymous], 2012, P IEEE INT EL DEV M
  • [3] [Anonymous], INT J ENG TECH RES
  • [4] [Anonymous], 2017, COMMUN ACM, DOI DOI 10.1145/3065386
  • [5] [Anonymous], 2016, ABS160606160 ARXIV
  • [6] [Anonymous], 2015, P INT C LEARN REP IC
  • [7] [Anonymous], INT J VLSI DESIGN CO
  • [8] [Anonymous], P IEEE INT EL DEV M
  • [9] [Anonymous], P 52 ANN DES AUT C
  • [10] [Anonymous], 2010, MNIST HANDWRITTEN DI