Modeling GPU Dynamic Parallelism for self similar density workloads

被引:1
作者
Quezada, Felipe A. [1 ]
Navarro, Cristobal A. [1 ]
Romero, Miguel [2 ,3 ]
Aguilera, Cristhian [4 ]
机构
[1] Univ Austral Chile, Inst Informat, Valdivia, Chile
[2] Univ Adolfo Ibanez, Fac Engn & Sci, Santiago, Chile
[3] CENIA, Santiago, Chile
[4] Univ San Sebastian, Fac Ingn Arquitectura & Diseno, Puerto Montt, Chile
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2023年 / 145卷
关键词
GPU; Dynamic Parallelism; Subdivision; Heterogeneous workload; Kernel recursion overhead; Self similar density;
D O I
10.1016/j.future.2023.03.046
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Dynamic Parallelism (DP) is a GPU programming abstraction that can make parallel computation more efficient for problems that exhibit heterogeneous workloads. With DP, GPU threads can launch kernels with more threads, recursively, producing a subdivision effect where resources are focused on the regions that exhibit more parallel work. Doing an optimal subdivision process is not trivial, as the combination of different parameters play a relevant role in the final performance of DP. Also, the current programming abstraction of DP relies on kernel recursion, which has performance overhead. This work presents a new subdivision cost model for problems that exhibit self similar density (SSD) workloads, useful for finding efficient subdivision schemes. Also, a new subdivision implementation free of recursion overhead is presented, named Adaptive Serial Kernels (ASK). Using the Mandelbrot set as a case study, the cost model shows that optimal performance is achieved when using {g -32, r -2, B -32} for the initial subdivision, recurrent subdivision and stopping size, respectively. Experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the ASK approach runs up to -60% faster than DP in the Mandelbrot set, and up to 12x faster than a basic exhaustive implementation, whereas DP is up to 7.5x faster. In terms of energy efficiency, ASK is up to -2x and -20x more energy efficient than DP and the exhaustive approach, respectively. These results put the subdivision cost model and the ASK approach as useful tools for analyzing the potential improvement of subdivision based approaches and for developing more efficient GPU-based libraries or fine-tune specific codes in research teams.(c) 2023 Elsevier B.V. All rights reserved.
引用
收藏
页码:239 / 253
页数:15
相关论文
共 43 条
[1]   Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs [J].
Abdelfattah, Ahmad ;
Ltaief, Hatem ;
Keyes, David ;
Dongarra, Jack .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (12) :3447-3465
[2]  
Adinetz A., 2014, ADAPTIVE PARALLEL CO
[3]   Using Dynamic Parallelism to Speed-Up Clustering-Based Community Detection in Social Networks [J].
Alandoli, Mohammed ;
Al-Ayyoub, Mahmoud ;
Al-Smadi, Mohammad ;
Jararweh, Yaser ;
Benkhelifa, Elhadj .
2016 IEEE 4TH INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD WORKSHOPS (FICLOUDW), 2016, :240-245
[4]  
[Anonymous], 2008, 2008 IEEE Hot Chips 20 Symposium (HCS), DOI 10.1109/HOTCHIPS.2008.7476516
[5]  
Bailey M.W., 2001, TR20012 HAM COLL
[6]  
B‚dorf J, 2020, Arxiv, DOI arXiv:1909.07439
[7]   A sparse octree gravitational N-body code that runs entirely on the GPU processor [J].
Bedorf, Jeroen ;
Gaburov, Evghenii ;
Zwart, Simon Portegies .
JOURNAL OF COMPUTATIONAL PHYSICS, 2012, 231 (07) :2825-2839
[8]  
Bohm C., 2020, ARXIV
[9]   Utilizing dynamic parallelism in CUDA to accelerate a 3D red-black successive over relaxation wind-field solver [J].
Bozorgmehr, Behnam ;
Willemsen, Pete ;
Gibbs, Jeremy A. ;
Stoll, Rob ;
Kim, Jae-Jin ;
Pardyjak, Eric R. .
ENVIRONMENTAL MODELLING & SOFTWARE, 2021, 137
[10]   Accelerating Reduction and Scan Using Tensor Core Units [J].
Dakkak, Abdul ;
Li, Cheng ;
Gelado, Isaac ;
Xiong, Jinjun ;
Hwu, Wen-mei .
INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS 2019), 2019, :46-57