FPGA demonstrator of a Programmable Ultra-efficient Memristor-based Machine Learning Inference Accelerator

被引:1
作者
Foltin, Martin [1 ]
Warner, Craig [2 ]
Lee, Eddie [2 ]
Chalamalasetti, Sai Rahul [3 ]
Brueggen, Chris [2 ]
Williams, Charles [2 ]
Jansen, Nathaniel [4 ]
Saenz, Felipe [5 ]
Li, Luis Federico [5 ]
Aguiar, Glaucimar [6 ]
Antunes, Rodrigo [7 ]
Silveira, Plinio [7 ]
Knuppe, Gustavo [7 ]
Ambrosi, Joao [7 ]
Chatterjee, Soumitra [8 ]
Kolhe, Jitendra Onkar [8 ]
Lakshiminarashimha, Sunil [9 ]
Milojicic, Dejan [3 ]
Strachan, John Paul [3 ]
Sharma, Amit [3 ]
机构
[1] Hewlett Packard Enterprise, Silicon Design Lab, Ft Collins, CO 80528 USA
[2] Hewlett Packard Enterprise, Silicon Design Lab, Plano, TX USA
[3] Hewlett Packard Enterprise, Silicon Design Lab, Palo Alto, CA USA
[4] Hewlett Packard Enterprise, Silicon Design Lab, Houston, TX USA
[5] Hewlett Packard Enterprise, Silicon Design Lab, Heredia, Costa Rica
[6] Hewlett Packard Enterprise, Brazil Labs, Barueri, Brazil
[7] Hewlett Packard Enterprise, Brazil Labs, Porto Alegre, RS, Brazil
[8] Hewlett Packard Enterprise, SSTO RnD, Bangalore, Karnataka, India
[9] Hewlett Packard Enterprise, Composable Engn, Bangalore, Karnataka, India
来源
PROCEEDINGS OF THE 2019 FOURTH IEEE INTERNATIONAL CONFERENCE ON REBOOTING COMPUTING (ICRC) | 2019年
关键词
Deep Neural Network Inference; Neural Network Acceleration; Memristor;
D O I
10.1109/icrc.2019.8914705
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Hybrid analog-digital neuromorphic accelerators show promise for significant increase in performance per watt of deep learning inference and training as compared with conventional technologies. In this work we present an FPGA demonstrator of a programmable hybrid inferencing accelerator, with memristor analog dot product engines emulated by digital matrix-vector multiplication units employing FPGA SRAM memory for in-situ weight storage. The full-chip demonstrator interfaced to a host by PCIe interface serves as a software development platform and a vehicle for further hardware microarchitecture improvements. Implementation of compute cores, tiles, network on a chip, and the host interface is discussed. New pipelining scheme is introduced to achieve high utilization of matrix-vector multiplication units while reducing tile data memory size requirements for neural network layer activations. The data flow orchestration between the tiles is described, controlled by a RISC-V core. Inferencing accuracy analysis is presented for an example RNN and CNN models. The demonstrator is instrumented with hardware monitors to enable performance measurements and tuning. Performance projections for future memristor-based ASIC are also discussed.
引用
收藏
页码:44 / 53
页数:10
相关论文
共 8 条
  • [1] Ankit A., P 24 INT C ARCH SUPP, P715
  • [2] Chen Y., P 47 ANN IEEE ACM IN, P609
  • [3] Lin T., 14050312 ARXIV
  • [4] Migacz S., NVIDIA GPU TECH C 20
  • [5] Nag A., 2018, 180306913 ARXIV
  • [6] Redmon J, 161208242 ARXIV
  • [7] Sambhav R.J., 2019, 190308066 ARXIV
  • [8] Shafiee A., P 43 INT S COMP ARCH, P14