Berti: an Accurate Local-Delta Data Prefetcher

被引:32
作者
Navarro-Torres, Agustin [1 ]
Panda, Biswabandan [2 ]
Alastruey-Benede, Jesus [1 ]
Ibanez, Pablo [1 ]
Vinals-Yufera, Victor [1 ]
Ros, Alberto [3 ]
机构
[1] Univ Zaragoza, Dept Informt & Ingn Sistemas I3A, Zaragoza, Spain
[2] Indian Inst Technol, Dept Comp Sci & Engn, Mumbai, Maharashtra, India
[3] Univ Murcia, Dept Comp Engn, Murcia, Spain
来源
2022 55TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO) | 2022年
基金
欧洲研究理事会;
关键词
data prefetching; hardware prefetching; first-level cache; local deltas; accuracy; timeliness; PERFORMANCE; PREDICTION;
D O I
10.1109/MICRO56248.2022.00072
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data prefetching is a technique that plays a crucial role in modern high-performance processors by hiding long latency memory accesses. Several state-of-the-art hardware prefetchers exploit the concept of deltas, defined as the difference between the cache line addresses of two demand accesses. Existing delta prefetchers, such as best offset prefetching (BOP) and multi-lookahead prefetching (MLOP), train and predict future accesses based on global deltas. We observed that the use of global deltas results in missed opportunities to anticipate memory accesses. In this paper, we propose Berti, a first-level data cache prefetcher that selects the best local deltas, i.e., those that consider only demand accesses issued by the same instruction. Thanks to a high-confidence mechanism that precisely detects the timely local deltas with high coverage, Berti generates accurate prefetch requests. Then, it orchestrates the prefetch requests to the memory hierarchy, using the selected deltas. Our empirical results using ChampSim and SPEC CPU2017 and GAP workloads show that, with a storage overhead of just 2.55 KB, Berti improves performance by 8.5% compared to a baseline IP-stride and 15% compared to IPCP, a state-of-the-art prefetcher. Our evaluation also shows that Berti reduces dynamic energy at the memory hierarchy by 33.6% compared to IPCP, thanks to its high prefetch accuracy.
引用
收藏
页码:975 / 991
页数:17
相关论文
共 59 条
[1]   ABS: A Low-Cost Adaptive Controller for Prefetching in a Banked Shared Last-Level Cache [J].
Albericio, Jorge ;
Gran, Ruben ;
Ibanez, Pablo ;
Vinals, Victor ;
Maria Llaberia, Jose .
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2012, 8 (04)
[2]  
[Anonymous], 2019, The 3rd data prefetching championship
[3]  
[Anonymous], 2009, The 1st data prefetching championship (DPC-1)
[4]  
[Anonymous], 2021, GAP traces for champsim
[5]  
[Anonymous], 2020, ChampSim Simulator
[6]  
[Anonymous], 2015, The 2nd data prefetching championship (dpc-2)
[7]  
[Anonymous], 2018, SunnyCove microarhcitecture latency
[8]  
[Anonymous], 2019, SPEC CPU 2017 traces for champsim
[9]  
[Anonymous], 2015, Micron dram power calculator
[10]  
[Anonymous], 2017, CloudSuite traces for ChampSim