Exploiting Page Table Locality for Agile TLB Prefetching

被引:20
作者
Vavouliotis, Georgios [1 ,2 ]
Alvarez, Lluc [1 ,2 ]
Karakostas, Vasileios [3 ]
Nikas, Konstantinos [3 ]
Koziris, Nectarios [3 ]
Jimenez, Daniel A. [4 ]
Casas, Marc [1 ,2 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Natl Tech Univ Athens, Athens, Greece
[4] Texas A&M Univ, College Stn, TX 77843 USA
来源
2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021) | 2021年
关键词
D O I
10.1109/ISCA52012.2021.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching page table entries (PTEs) ahead of demand TLB accesses can mitigate the address translation performance bottleneck, but each prefetch requires traversing the page table, triggering additional accesses to the memory hierarchy. Therefore, TLB prefetching is a costly technique that may undermine performance when the prefetches are not accurate. In this paper we exploit the locality in the last level of the page table to reduce the cost and enhance the effectiveness of TLB prefetching by fetching cache-line adjacent PTEs "for free". We propose Sampling-Based Free TLB Prefetching (SBFP), a dynamic scheme that predicts the usefulness of these "free" PTEs and prefetches only the ones most likely to prevent TLB misses We demonstrate that combining SBFP with novel and state-of-the-art TLB prefetchers significantly improves miss coverage and reduces most memory accesses due to page walks. Moreover, we propose Agile TLB Prefetcher (ATP), a novel composite TLB prefetcher particularly designed to maximize the benefits of SBFP. ATP efficiently combines three low-cost TLB prefetchers and disables TLB prefetching for those execution phases that do not benefit from it. Unlike state-of-the-art TLB prefetchers that correlate patterns with only one feature (e.g., strides, PC, distances), ATP correlates patterns with multiple features and dynamically enables the most appropriate TLB prefetcher per TLB miss. To alleviate the address translation performance bottleneck, we propose a unified solution that combines ATP and SBFP. Across an extensive set of industrial workloads provided by Qualcomm, ATP coupled with SBFP improves geometric speedup by 16.2%, and eliminates on average 37% of the memory references due to page walks. Considering the SPEC CPU 2006 and SPEC CPU 2017 benchmark suites, ATP with SBFP increases geometric speedup by 11.1%, and eliminates page walk memory references by 26%. Applied to big data workloads (GAP suite, XSBench), ATP with SBFP yields a geometric speedup of 11.8% while reducing page walk memory references by 5%. Over the best state-of-the-art TLB prefetcher for each benchmark suite, ATP with SBFP achieves speedups of 8.7%, 3.4%, and 4.2% for the Qualcomm, SPEC, and GAP+XSBench workloads, respectively.
引用
收藏
页码:85 / 98
页数:14
相关论文
共 55 条
[1]   Do-It-Yourself Virtual Memory Translation [J].
Alam, Hanna ;
Zhang, Tianhao ;
Erez, Mattan ;
Etsion, Yoav .
44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2017), 2017, :457-468
[2]   Enhancing and Exploiting Contiguity for Fast Memory Virtualization [J].
Alverti, Chloe ;
Psomadakis, Stratos ;
Karakostas, Vasileios ;
Gandhi, Jayneel ;
Nikas, Konstantinos ;
Goumas, Georgios ;
Koziris, Nectarios .
2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, :515-528
[3]  
Anderson T. E., 1991, SIGPLAN Notices, V26, P108, DOI 10.1145/106973.106985
[4]  
[Anonymous], 1992, P 19 ANN INT S COMP
[5]  
[Anonymous], 2013, P IEEE ACM INT S MIC
[6]  
[Anonymous], 1995, IEEE T COMPUT
[7]  
[Anonymous], 2010, P 15 ED ASPLOS ARCH
[8]  
[Anonymous], 2006, ACM SIGARCH Computer Architecture News, DOI [DOI 10.1145/1186736.1186737, 10.1145/1186736.1186737]
[9]   Memory Hierarchy for Web Search [J].
Ayers, Grant ;
Ahn, Jung Ho ;
Kozyrakis, Christos ;
Ranganathan, Parthasarathy .
2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2018, :643-656
[10]  
Barr Thomas W., 2010, Computer Architecture News, V38, P48, DOI 10.1145/1816038.1815970