DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

被引:0
作者
Garofalo, Angelo [1 ]
Tortorella, Yvan [1 ]
Perotti, Matteo [2 ]
Valente, Luca [1 ]
Nadalini, Alessandro [1 ]
Benini, Luca [1 ,2 ]
Rossi, Davide [1 ]
Conti, Francesco [1 ]
机构
[1] Univ Bologna, Dept Elect Elect & Informat Engn, I-40126 Bologna, Italy
[2] Swiss Fed Inst Technol, IIS Integrated Syst Lab, CH-8092 Zurich, Switzerland
来源
IEEE OPEN JOURNAL OF THE SOLID-STATE CIRCUITS SOCIETY | 2022年 / 2卷
基金
欧盟地平线“2020”;
关键词
Training; Kernel; Engines; System-on-chip; Human computer interaction; Hardware; Arithmetic; Heterogeneous cluster; tensor product engine (TPE); ultralow-power AI; NEURAL-NETWORKS;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of eight RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost the performance and efficiency on key compute-intensive DNN kernels, the cluster is enriched with three digital accelerators: 1) a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); 2) a minimal overhead datamover to marshal 1-32-b data on-the-fly; and 3) a 16-b floating-point tensor product engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65-nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency-enough to enable on-chip floating-point training at competitive speed coupled with ultralow power quantized inference.
引用
收藏
页码:231 / 243
页数:13
相关论文
共 36 条
  • [1] [Anonymous], 2021, TRAINING MIXED PRECI
  • [2] Banbury Colby, 2021, arXiv
  • [3] SleepRunner: A 28-nm FDSOI ULP Cortex-M0 MCU With ULL SRAM and UFBR PVT Compensation for 2.6-3.6-μW/DMIPS 40-80-MHz Active Mode and 131-nW/kB Fully Retentive Deep-Sleep Mode
    Bol, David
    Schramme, Maxime
    Moreau, Ludovic
    Xu, Pengcheng
    Dekimpe, Remi
    Saeidi, Roghayeh
    Haine, Thomas
    Frenkel, Charlotte
    Flandre, Denis
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2021, 56 (07) : 2256 - 2269
  • [4] GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors
    Bruschi, Nazareno
    Haugou, Germain
    Tagliavini, Giuseppe
    Conti, Francesco
    Benini, Luca
    Rossi, Davide
    [J]. 2021 IEEE 39TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2021), 2021, : 409 - 416
  • [5] DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs
    Burrello, Alessio
    Garofalo, Angelo
    Bruschi, Nazareno
    Tagliavini, Giuseppe
    Rossi, Davide
    Conti, Francesco
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (08) : 1253 - 1268
  • [6] Cai H, 2019, Arxiv, DOI arXiv:1812.00332
  • [7] Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
    Chen, Yu-Hsin
    Yange, Tien-Ju
    Emer, Joel S.
    Sze, Vivienne
    [J]. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) : 292 - 308
  • [8] Desoli G, 2017, ISSCC DIG TECH PAP I, P238, DOI 10.1109/ISSCC.2017.7870349
  • [9] Garofalo A., 2022, arXiv
  • [10] XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes
    Garofalo, Angelo
    Tagliavini, Giuseppe
    Conti, Francesco
    Benini, Luca
    Rossi, Davide
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2021, 9 (03) : 1489 - 1505