RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel Executions

被引:1
作者
Kiani, Mohsen [1 ]
Rajabzadeh, Amir [1 ]
机构
[1] Razi Univ, Dept Comp Engn & Informat Technol, Kermanshah, Iran
关键词
Performance modeling; cache memory; GPU; reuse distance analysis;
D O I
10.1142/S0218126619502451
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern GPUs can execute multiple kernels concurrently to keep the hardware resources busy and to boost the overall performance. This approach is called simultaneous multiple kernel execution (MKE). MKE is a promising approach for improving GPU hardware utilization. Although modern GPUs allow MKE, the effects of different MKE scenarios have not adequately studied by the researchers. Since cache memories have significant effects on the overall GPU performance, the effects of MKE on cache performance should be investigated properly. The present study proposes a framework, called RDMKE (short for Reuse Distance-based profiling in MKEs), to provide a method for analyzing GPU cache memory performance in MKE scenarios. The raw memory access information of a kernel is first extracted and then RDMKE enforces a proper ordering to the memory accesses so that it represents a given MKE scenario. Afterward, RDMKE employs reuse distance analysis (RDA) to generate cache-related performance metrics, including hit ratios, transaction counts, cache sets and Miss Status Holding Register reservation fails. In addition, RDMKE provides the user with the RD profiles as a useful locality metric. The simulation results of single kernel executions show a fair correlation between the generated results by RDMKE and GPU performance counters. Further, the simulation results of 28 two-kernel executions indicate that RDMKE can properly capture the nonlinear cache behaviors in MKE scenarios.
引用
收藏
页数:26
相关论文
共 36 条
[1]  
Adriaens JT, 2012, INT S HIGH PERF COMP, P79
[2]  
Baghsorkhi S. S., 2009, WORKSH EPHAM2009 CON, P1
[3]   Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors [J].
Baghsorkhi, Sara S. ;
Gelado, Isaac ;
Delahaye, Matthieu ;
Hwu, Wen-mei W. .
ACM SIGPLAN NOTICES, 2012, 47 (08) :23-33
[4]   An Adaptive Performance Modeling Tool for GPU Architectures [J].
Baghsorkhi, Sara S. ;
Delahaye, Matthieu ;
Patel, Sanjay J. ;
Gropp, William D. ;
Hwu, Wen-mei W. .
PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2010, :105-114
[5]  
Bakhoda A, 2009, INT SYM PERFORM ANAL, P163, DOI 10.1109/ISPASS.2009.4919648
[6]  
Beyls K., 2001, Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, V14, P350
[7]   Accurately modeling the on-chip and off-chip GPU memory subsystem [J].
Candel, Francisco ;
Petit, Salvador ;
Sahuquillo, Julio ;
Duato, Jose .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 82 :510-519
[8]  
Che SA, 2009, I S WORKL CHAR PROC, P44, DOI 10.1109/IISWC.2009.5306797
[9]   Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems [J].
Diamos, Gregory ;
Kerr, Andrew ;
Yalamanchili, Sudhakar ;
Clark, Nathan .
PACT 2010: PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2010, :353-364
[10]   StatStack: Efficient Modeling of LRU caches [J].
Eklov, David ;
Hagersten, Erik .
2010 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS 2010), 2010, :55-65