Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters

被引：1

作者：

Chen, Jie ^{[1
,2
]}

Loi, Igor ^{[2
]}

Flamand, Eric ^{[2
]}

Tagliavini, Giuseppe ^{[3
]}

Benini, Luca ^{[1
,4
]}

Rossi, Davide ^{[1
]}

机构：

[1] Univ Bologna, Dept Elect Elect & Informat Engn Guglielmo Marconi, I-40126 Bologna, Italy

[2] GreenWaves Technol, F-38100 Grenoble, France

[3] Univ Bologna, Dept Comp Sci & Engn DISI, I-40126 Bologna, Italy

[4] Swiss Fed Inst Technol, Dept Informat Technol & Elect Engn, CH-8092 Zurich, Switzerland

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2023年 / 31卷 / 04期

基金：

欧盟地平线“2020”;

关键词：

Prefetching; Codes; Scalability; Internet of Things; Multicore processing; Switched mode power supplies; Standards; Energy efficiency; instruction cache; parallel; prefetch; ultralow-power (ULP);

D O I：

10.1109/TVLSI.2022.3228336

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.

引用

页码：456 / 469

页数：14

共 35 条

[1] Benini L, 2012, DES AUT TEST EUROPE, P983
[2] Cabo G., 2021, ETS, P1
[3] Near-optimal replacement policies for shared caches in multicore processors
Diaz, Javier
Ibanez, Pablo
Monreal, Teresa
Vinals, Victor
Llaberia, Jose M.
[J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (10) : 11756 - 11785
[4] Ferdman M, 2011, INT SYMP MICROARCH, P152
[5] Ferdman M, 2008, INT SYMP MICROARCH, P1, DOI 10.1109/MICRO.2008.4771774
[6] SOME COMPUTER ORGANIZATIONS AND THEIR EFFECTIVENESS
FLYNN, MJ
[J]. IEEE TRANSACTIONS ON COMPUTERS, 1972, C 21 (09) : 948 - &
[7] Near-Threshold RISC-VCore With DSP Extensions for Scalable IoT Endpoint Devices
Gautschi, Michael
Schiavone, Pasquale Davide
Traber, Andreas
Loi, Igor
Pullini, Antonio
Rossi, Davide
Flamand, Eric
Gurkaynak, Frank K.
Benini, Luca
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2017, 25 (10) : 2700 - 2713
[8] SRCP: sharing and reuse-aware replacement policy for the partitioned cache in multicore systems
Ghosh, Soma Niloy
Bhargava, Lava
Sahula, Vineet
[J]. DESIGN AUTOMATION FOR EMBEDDED SYSTEMS, 2021, 25 (03) : 193 - 211
[9] GreenWaves Technologies, 2018, GAP8 AUT MAN
[10] GreenWaves Technologies, 2022, GAP9 PROD BRIEF

← 1 2 3 4 →