A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing

被引:129
作者
Jia, Hongyang [1 ]
Valavi, Hossein [1 ]
Tang, Yinqi [1 ]
Zhang, Jintao [2 ]
Verma, Naveen [1 ]
机构
[1] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA
[2] IBM TJ Watson Ctr, Ossining, NY 10562 USA
关键词
Charge-domain compute; deep learning; hardware accelerators; in-memory computing (IMC); neural networks (NNs); CHIP; SRAM;
D O I
10.1109/JSSC.2020.2987714
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In-memory computing (IMC) addresses the cost of accessing data from memory in a manner that introduces a tradeoff between energy/throughput and computation signal-to-noise ratio (SNR). However, low SNR posed a primary restriction to integrating IMC in larger, heterogeneous architectures required for practical workloads due to the challenges with creating robust abstractions necessary for the hardware and software stack. This work exploits recent progress in high-SNR IMC to achieve a programmable heterogeneous microprocessor architecture implemented in 65-nm CMOS and corresponding interfaces to the software that enables mapping of application workloads. The architecture consists of a 590-Kb IMC accelerator, configurable digital near-memory-computing (NMC) accelerator, RISC-V CPU, and other peripherals. To enable programmability, microarchitectural design of the IMC accelerator provides the integration in the standard processor memory space, area- and energy-efficient analog-to-digital conversion for interfacing to NMC, bit-scalable computation (1-8 b), and input-vector sparsity-proportional energy consumption. The IMC accelerator demonstrates excellent matching between computed outputs and idealized software-modeled outputs, at 1b TOPS/W of 192 vertical bar 400 and 1b-TOPS/mm(2) of 0.60 vertical bar 0.24 for MAC hardware, at V-DD of 1.2 vertical bar 0.85 V, both of which scale directly with the bit precision of the input vector and matrix elements. Software libraries developed for application mapping are used to demonstrate CIFAR-10 image classification with a ten-layer CNN, achieving accuracy, throughput, and energy of 89.3%vertical bar 92.4%, 176 vertical bar 23 images/s, and 5.31 vertical bar 105.2 mu J/image, for 1 vertical bar 4 b quantization levels.
引用
收藏
页码:2609 / 2621
页数:13
相关论文
共 35 条
  • [1] Amodei D, 2016, PR MACH LEARN RES, V48
  • [2] BRein Memory: A Single-Chip Binary/Ternary Reconfigurable in-Memory Deep Neural Network Accelerator Achieving 1.4 TOPS at 0.6 W
    Ando, Kota
    Ueyoshi, Kodai
    Orimo, Kentaro
    Yonekawa, Haruyoshi
    Sato, Shimpei
    Nakahara, Hiroki
    Takamaeda-Yamazaki, Shinya
    Ikebe, Masayuki
    Asai, Tetsuya
    Kuroda, Tadahiro
    Motomura, Masato
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2018, 53 (04) : 983 - 994
  • [3] [Anonymous], 1997, Neural Computation
  • [4] An Always-On 3.8 μJ/86% CIFAR-10 Mixed-Signal Binary CNN Processor With All Memory on Chip in 28-nm CMOS
    Bankman, Daniel
    Yang, Lita
    Moons, Bert
    Verhelst, Marian
    Murmann, Boris
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2019, 54 (01) : 158 - 172
  • [5] Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks
    Chen, Yu-Hsin
    Krishna, Tushar
    Emer, Joel S.
    Sze, Vivienne
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2017, 52 (01) : 127 - 138
  • [6] Choi Jungwook, 2018, P INT C LEARN REPR I
  • [7] Grow and Prune Compact, Fast, and Accurate LSTMs
    Dai, Xiaoliang
    Yin, Hongxu
    Jha, Niraj K.
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2020, 69 (03) : 441 - 452
  • [8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [9] Gonugondla SK, 2018, ISSCC DIG TECH PAP I, P490, DOI 10.1109/ISSCC.2018.8310398
  • [10] Guo RQ, 2019, SYMP VLSI CIRCUITS, pC120, DOI [10.23919/vlsic.2019.8778028, 10.23919/VLSIC.2019.8778028]