Near-Threshold RISC-VCore With DSP Extensions for Scalable IoT Endpoint Devices

被引:306
作者
Gautschi, Michael [1 ]
Schiavone, Pasquale Davide [1 ]
Traber, Andreas [2 ,3 ]
Loi, Igor [5 ,6 ]
Pullini, Antonio [1 ]
Rossi, Davide [5 ,6 ]
Flamand, Eric [1 ,5 ]
Gurkaynak, Frank K. [1 ]
Benini, Luca [1 ,4 ]
机构
[1] Swiss Fed Inst Technol, Integrated Syst Lab, CH-8092 Zurich, Switzerland
[2] Swiss Fed Inst Technol, Integrated Syst Lab, Elect Engn & Informat Technol, CH-8702 Zollikon, Switzerland
[3] Adv Circuit Pursuit, CH-8702 Zollikon, Switzerland
[4] Univ Bologna, I-40126 Bologna, Italy
[5] GreenWaves Technol, F-38190 Villard Bonnot, France
[6] Univ Bologna, I-40136 Bologna, Italy
基金
欧洲研究理事会;
关键词
Instruction set architecture (ISA) extensions; Internet-of-Things; multicore; RISC-V; ultralow power (ULP); PROCESSOR ARCHITECTURE; POWER; CLUSTER;
D O I
10.1109/TVLSI.2017.2654506
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold (NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper, we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multicore clusters. We introduce instruction extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy. For typical data-intensive sensor processing workloads, the proposed core is, on average, 3.5x faster and 3.2x more energy efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. Single Instruction Multiple Data extensions, such as dot products, and a built-in L0 storage further reduce the shared-memory accesses by 8x reducing contentions by 3.2x. With four NT-optimized cores, the cluster is operational from 0.6 to 1.2 V, achieving a peak efficiency of 67 MOPS/mW in a low-cost 65-nm bulk CMOS technology. In a low-power 28-nm FD-SOI process, a peak efficiency of 193 MOPS/mW (40 MHz and 1 mW) can be achieved.
引用
收藏
页码:2700 / 2713
页数:14
相关论文
共 43 条
[1]  
[Anonymous], 2016, P 19 IEEE S LOW POWE
[2]  
Azizi O, 2010, CONF PROC INT SYMP C, P26, DOI 10.1145/1816038.1815967
[3]   Instruction buffering to reduce power in processors for signal processing [J].
Bajwa, RS ;
Hiraki, M ;
Kojima, H ;
Gorny, DJ ;
Nitta, K ;
Shridhar, A ;
Seki, K ;
Sasaki, K .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 1997, 5 (04) :417-424
[4]  
Banakar R, 2002, CODES 2002: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON HARDWARE/SOFTWARE CODESIGN, P73, DOI 10.1109/CODES.2002.1003604
[5]   Origami: A 803-GOp/s/W Convolutional Network Accelerator [J].
Cavigelli, Lukas ;
Benini, Luca .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2017, 27 (11) :2461-2475
[6]   Fixed-Point Computing Element Design for Transcendental Functions and Primary Operations in Speech Processing [J].
Chang, Chung-Hsien ;
Chen, Shi-Huang ;
Chen, Bo-Wei ;
Ji, Wen ;
Bharanitharan, K. ;
Wang, Jhing-Fa .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2016, 24 (05) :1993-1997
[7]   Performance evaluation of an SIMD architecture with a multi-bank vector memory unit [J].
Chang, Hoseok ;
Cho, Junho ;
Sun, Wonyong .
2006 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS DESIGN AND IMPLEMENTATION, 2006, :71-76
[8]  
Conti F, 2015, DES AUT TEST EUROPE, P683
[9]   Low-power processor architecture exploration for online biomedical signal analysis [J].
Dogan, A. Y. ;
Constantin, J. ;
Atienza, D. ;
Burg, A. ;
Benini, L. .
IET CIRCUITS DEVICES & SYSTEMS, 2012, 6 (05) :279-286
[10]   Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits [J].
Dreslinski, Ronald G. ;
Wieckowski, Michael ;
Blaauw, David ;
Sylvester, Dennis ;
Mudge, Trevor .
PROCEEDINGS OF THE IEEE, 2010, 98 (02) :253-266