Using hardware performance counters to speed up autotuning convergence on GPUs

被引:9
作者
Filipovic, Jiri [1 ]
Hozzova, Jana [1 ]
Nezarat, Amin [1 ]
Ol'ha, Jaroslav [1 ]
Petrovic, Filip [1 ]
机构
[1] Masaryk Univ, Inst Comp Sci, Bot 68a, Brno 60200, Czech Republic
关键词
Auto-tuning; Search method; Performance counters; Cuda; PARALLELISM;
D O I
10.1016/j.jpdc.2021.10.003
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching generic tuning spaces. The tuning spaces can contain tuning parameters changing any user-defined property of the source code. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time. (c) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页码:16 / 35
页数:20
相关论文
共 44 条
[1]   Hardware Counters' Space Reduction for Code Region Characterization [J].
Alcaraz, Jordi ;
Sikora, Anna ;
Cesar, Eduardo .
EURO-PAR 2019: PARALLEL PROCESSING, 2019, 11725 :74-86
[2]  
[Anonymous], 2017, 2017 IEEE 19 INT C H
[3]   OpenTuner: An Extensible Framework for Program Autotuning [J].
Ansel, Jason ;
Kamil, Shoaib ;
Veeramachaneni, Kalyan ;
Ragan-Kelley, Jonathan ;
Bosboom, Jeffrey ;
O'Reilly, Una-May ;
Amarasinghe, Saman .
PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'14), 2014, :303-315
[4]   An Adaptive Performance Modeling Tool for GPU Architectures [J].
Baghsorkhi, Sara S. ;
Delahaye, Matthieu ;
Patel, Sanjay J. ;
Gropp, William D. ;
Hwu, Wen-mei W. .
PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2010, :105-114
[5]  
Bakhoda A, 2009, INT SYM PERFORM ANAL, P163, DOI 10.1109/ISPASS.2009.4919648
[6]   Autotuning in High-Performance Computing Applications [J].
Balaprakash, Prasanna ;
Dongarra, Jack ;
Gamblin, Todd ;
Hall, Mary ;
Hollingsworth, Jeffrey K. ;
Norris, Boyana ;
Vuduc, Richard .
PROCEEDINGS OF THE IEEE, 2018, 106 (11) :2068-2083
[7]   Can search algorithms save large-scale automatic performance tuning? [J].
Balaprakash, Prasanna ;
Wild, Stefan M. ;
Hovland, Paul D. .
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 :2136-2145
[8]  
Cavazos J, 2007, INT SYM CODE GENER, P185
[9]   Automatically Selecting Profitable Thread Block Sizes for Accelerated Kernels [J].
Connors, Tiffany A. ;
Qasem, Apan .
2017 19TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS (HPCC) / 2017 15TH IEEE INTERNATIONAL CONFERENCE ON SMART CITY (SMARTCITY) / 2017 3RD IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (DSS), 2017, :442-449
[10]   End-to-end Deep Learning of Optimization Heuristics [J].
Cummins, Chris ;
Petoumenos, Pavlos ;
Wang, Zheng ;
Leather, Hugh .
2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, :219-232