DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

被引：0

作者：

Garofalo, Angelo ^{[1
]}

Tortorella, Yvan ^{[1
]}

Perotti, Matteo ^{[2
]}

Valente, Luca ^{[1
]}

Nadalini, Alessandro ^{[1
]}

Benini, Luca ^{[1
,2
]}

Rossi, Davide ^{[1
]}

Conti, Francesco ^{[1
]}

机构：

[1] Univ Bologna, Dept Elect Elect & Informat Engn, I-40126 Bologna, Italy

[2] Swiss Fed Inst Technol, IIS Integrated Syst Lab, CH-8092 Zurich, Switzerland

来源：

IEEE OPEN JOURNAL OF THE SOLID-STATE CIRCUITS SOCIETY | 2022年 / 2卷

基金：

欧盟地平线“2020”;

关键词：

Training; Kernel; Engines; System-on-chip; Human computer interaction; Hardware; Arithmetic; Heterogeneous cluster; tensor product engine (TPE); ultralow-power AI; NEURAL-NETWORKS;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of eight RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost the performance and efficiency on key compute-intensive DNN kernels, the cluster is enriched with three digital accelerators: 1) a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); 2) a minimal overhead datamover to marshal 1-32-b data on-the-fly; and 3) a 16-b floating-point tensor product engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65-nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency-enough to enable on-chip floating-point training at competitive speed coupled with ultralow power quantized inference.

引用

页码：231 / 243

页数：13

共 36 条

[1] [Anonymous], 2021, TRAINING MIXED PRECI
[2] Banbury Colby, 2021, arXiv
[3] SleepRunner: A 28-nm FDSOI ULP Cortex-M0 MCU With ULL SRAM and UFBR PVT Compensation for 2.6-3.6-μW/DMIPS 40-80-MHz Active Mode and 131-nW/kB Fully Retentive Deep-Sleep Mode
Bol, David
Schramme, Maxime
Moreau, Ludovic
Xu, Pengcheng
Dekimpe, Remi
Saeidi, Roghayeh
Haine, Thomas
Frenkel, Charlotte
Flandre, Denis
[J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2021, 56 (07) : 2256 - 2269
[4] GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors
Bruschi, Nazareno
Haugou, Germain
Tagliavini, Giuseppe
Conti, Francesco
Benini, Luca
Rossi, Davide
[J]. 2021 IEEE 39TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2021), 2021, : 409 - 416
[5] DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs
Burrello, Alessio
Garofalo, Angelo
Bruschi, Nazareno
Tagliavini, Giuseppe
Rossi, Davide
Conti, Francesco
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (08) : 1253 - 1268
[6] Cai H, 2019, Arxiv, DOI arXiv:1812.00332
[7] Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
Chen, Yu-Hsin
Yange, Tien-Ju
Emer, Joel S.
Sze, Vivienne
[J]. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) : 292 - 308
[8] Desoli G, 2017, ISSCC DIG TECH PAP I, P238, DOI 10.1109/ISSCC.2017.7870349
[9] Garofalo A., 2022, arXiv
[10] XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes
Garofalo, Angelo
Tagliavini, Giuseppe
Conti, Francesco
Benini, Luca
Rossi, Davide
[J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2021, 9 (03) : 1489 - 1505

← 1 2 3 4 →