A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

被引:4
作者
Valentin Jamet, Alexandre [1 ]
Vavouliotis, Georgios [2 ]
Jimenez, Daniel A. [3 ]
Alvarez, Lluc [1 ]
Casas, Marc [1 ]
机构
[1] Univ Politecn Catalunya UPC, Barcelona Supercomp Ctr BSC, Barcelona, Spain
[2] Huawei Zurich Res Ctr, Zurich, Switzerland
[3] Texas A&M Univ, College Stn, TX 77843 USA
来源
2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024 | 2024年
基金
美国国家科学基金会;
关键词
D O I
10.1109/HPCA57654.2024.00046
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named First Level Predictor (FLP) and Second Level Predictor (SLP). FLP performs accurate off-chip prediction by using several program features based on virtual addresses and a novel selective delay component. The novelty of SLP relies on leveraging off-chip prediction to drive L1D prefetch filtering by using physical addresses and the FLP prediction as features. TLP constitutes the first hardware proposal targeting both off-chip prediction and prefetch filtering using a multi-level perceptron hardware approach. TLP only requires 7KB of storage. To demonstrate the benefits of TLP we compare its performance with state-of-the-art approaches using off-chip prediction and prefetch filtering on a wide range of single-core and multi-core workloads. Our experiments show that TLP reduces the average DRAM transactions by 30.7% and 17.7%, as compared to a baseline using state-of-the-art cache prefetchers but no off-chip prediction mechanism, across the single-core and multi-core workloads, respectively, while recent work significantly increases DRAM transactions. As a result, TLP achieves geometric mean performance speedups of 6.2% and 11.8% across single-core and multi-core workloads, respectively. In addition, our evaluation demonstrates that TLP is effective independently of the L1D prefetching logic.
引用
收藏
页码:528 / 542
页数:15
相关论文
共 58 条
[1]   A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing [J].
Ahn, Junwhan ;
Hong, Sungpack ;
Yoo, Sungjoo ;
Mutlu, Onur ;
Choi, Kiyoung .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :105-117
[2]  
Ainsworth Sam, 2016, P 2016 INT C SUP ICS, DOI [DOI 10.1145/2925426.2926254, 10.1145/ 2925426.2926254]
[3]  
[Anonymous], 2017, SPEC CPU 2017
[4]  
[Anonymous], Cascade lake-microarchitectures-intel-WikiChip
[5]  
[Anonymous], 2021, ChampSim
[6]  
[Anonymous], 2014, SPEC CPU 2006
[7]   P-OPT: Practical Optimal Cache Replacement for Graph Analytics [J].
Balaji, Vignesh ;
Crago, Neal ;
Jaleel, Aamer ;
Lucia, Brandon .
2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021), 2021, :668-681
[8]   Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing [J].
Balaji, Vignesh ;
Lucia, Brandon .
HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, :133-144
[9]  
Balaji V, 2018, I S WORKL CHAR PROC, P203, DOI 10.1109/IISWC.2018.8573478
[10]   Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads [J].
Basak, Abanti ;
Li, Shuangchen ;
Hu, Xing ;
Oh, Sang Min ;
Xie, Xinfeng ;
Zhao, Li ;
Jiang, Xiaowei ;
Xie, Yuan .
2019 25TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2019, :373-386