A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

被引：4

作者：

Valentin Jamet, Alexandre ^{[1
]}

Vavouliotis, Georgios ^{[2
]}

Jimenez, Daniel A. ^{[3
]}

Alvarez, Lluc ^{[1
]}

Casas, Marc ^{[1
]}

机构：

[1] Univ Politecn Catalunya UPC, Barcelona Supercomp Ctr BSC, Barcelona, Spain

[2] Huawei Zurich Res Ctr, Zurich, Switzerland

[3] Texas A&M Univ, College Stn, TX 77843 USA

来源：

2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024 | 2024年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/HPCA57654.2024.00046

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named First Level Predictor (FLP) and Second Level Predictor (SLP). FLP performs accurate off-chip prediction by using several program features based on virtual addresses and a novel selective delay component. The novelty of SLP relies on leveraging off-chip prediction to drive L1D prefetch filtering by using physical addresses and the FLP prediction as features. TLP constitutes the first hardware proposal targeting both off-chip prediction and prefetch filtering using a multi-level perceptron hardware approach. TLP only requires 7KB of storage. To demonstrate the benefits of TLP we compare its performance with state-of-the-art approaches using off-chip prediction and prefetch filtering on a wide range of single-core and multi-core workloads. Our experiments show that TLP reduces the average DRAM transactions by 30.7% and 17.7%, as compared to a baseline using state-of-the-art cache prefetchers but no off-chip prediction mechanism, across the single-core and multi-core workloads, respectively, while recent work significantly increases DRAM transactions. As a result, TLP achieves geometric mean performance speedups of 6.2% and 11.8% across single-core and multi-core workloads, respectively. In addition, our evaluation demonstrates that TLP is effective independently of the L1D prefetching logic.

引用

页码：528 / 542

页数：15

共 58 条

[1] A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing [J].

Ahn, Junwhan ;

Hong, Sungpack ;

Yoo, Sungjoo ;

Mutlu, Onur ;

Choi, Kiyoung .

2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :105-117

[2]

Ainsworth Sam, 2016, P 2016 INT C SUP ICS, DOI [DOI 10.1145/2925426.2926254, 10.1145/ 2925426.2926254]

[3]

[Anonymous], 2017, SPEC CPU 2017

[4]

[Anonymous], Cascade lake-microarchitectures-intel-WikiChip

[5]

[Anonymous], 2021, ChampSim

[6]

[Anonymous], 2014, SPEC CPU 2006

[7] P-OPT: Practical Optimal Cache Replacement for Graph Analytics [J].

Balaji, Vignesh ;

Crago, Neal ;

Jaleel, Aamer ;

Lucia, Brandon .

2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021), 2021, :668-681

[8] Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing [J].

Balaji, Vignesh ;

Lucia, Brandon .

HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, :133-144

[9]

Balaji V, 2018, I S WORKL CHAR PROC, P203, DOI 10.1109/IISWC.2018.8573478

[10] Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads [J].

Basak, Abanti ;

Li, Shuangchen ;

Hu, Xing ;

Oh, Sang Min ;

Xie, Xinfeng ;

Zhao, Li ;

Jiang, Xiaowei ;

Xie, Yuan .

2019 25TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2019, :373-386

← 1 2 3 4 5 6 →