GPUs Cache Performance Estimation using Reuse Distance Analysis

被引：18

作者：

Arafa, Yehia ^{[1
]}

Chennupati, Gopinath ^{[2
]}

Barai, Atanu ^{[1
]}

Badawy, Abdel-Hameed A. ^{[1
,2
]}

Santhi, Nandakishore ^{[2
]}

Eidenbenz, Stephan ^{[2
]}

机构：

[1] New Mexico State Univ, Klipsch Sch Elect & Comp Engn, Las Cruces, NM 88003 USA

[2] Los Alamos Natl Lab, Los Alamos, NM USA

来源：

2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC) | 2019年

关键词：

Reuse Distance; GPGPU; Caches; NVIDIA SASSI; LOCALITY;

D O I：

10.1109/ipccc47392.2019.8958760

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

GPU architects have introduced on-chip memories in GPUs to provide local storage nearby processing to reduce the traffic to the device global memory. From then on-wards, modeling to predict the cache performance has been an active area of research. However, due to the complexities found in this highly parallel hardware, this has not been a straightforward task. In this paper, we propose a memory model to predict the entire cache performance (L1 & L2 caches) in GPUs. Our model is based on reuse distance. We use an analytical probabilistic measure of the reuse distance distributions from the memory traces of an application to predict the hit rates. The application's memory trace is extracted using NVIDIA's SASSI instrumentation tool. We use 20 different kernels from Polybench and Rodinia benchmark suites and compare our model to the real hardware. The results show that the average prediction accuracy of the model over all the kernels is 86.7% compared to the real device with higher accuracy for the L2 (95.26%) cache than the L1. Furthermore, extracting the application's memory trace is on average 4.9x slower compared to the kernels running without instrumentation. This overhead is much smaller than other published results. Furthermore, our model is very flexible where it takes into account the different cache parameters thus it can be used for design space exploration and sensitivity analysis.

引用

页数：8

共 39 条

[1]

[Anonymous], 2012, NVIDIA KEPLER GPU AR

[2]

[Anonymous], 2018, CUDA Toolkit Documentation v10.0.130 - Parallel Thread Execution ISA Version 6.3

[3]

[Anonymous], 2014, NVIDIA MAXWELL GPU A

[4]

[Anonymous], 2018, cuda programming guide

[5]

Arafa Y., 2019, ABS190508778 CORR

[6] PPT-GPU: Scalable GPU Performance Modeling [J].

Arafa, Yehia ;

Badawy, Abdel-Hameed A. ;

Chennupati, Gopinath ;

Santhi, Nandakishore ;

Eidenbenz, Stephan .

IEEE COMPUTER ARCHITECTURE LETTERS, 2019, 18 (01) :55-58

[7] Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis [J].

Badawy, Abdel-Hameed A. ;

Yeung, Donald .

IEEE COMPUTER ARCHITECTURE LETTERS, 2017, 16 (02) :119-122

[8]

Badawy A, 2017, IEEE WCNC

[9] Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors [J].

Baghsorkhi, Sara S. ;

Gelado, Isaac ;

Delahaye, Matthieu ;

Hwu, Wen-mei W. .

ACM SIGPLAN NOTICES, 2012, 47 (08) :23-33

[10]

Bakhoda A, 2009, INT SYM PERFORM ANAL, P163, DOI 10.1109/ISPASS.2009.4919648

← 1 2 3 4 →