Load-balancing Sparse Matrix Vector Product Kernels on GPUs

被引:24
作者
Anzt, Hartwig [1 ,2 ]
Cojean, Terry [1 ]
Chen, Yen-Chen [3 ,7 ]
Dongarra, Jack [2 ,4 ,5 ]
Flegar, Goran [6 ]
Nayak, Pratik [1 ]
Tomov, Stanimire [2 ]
Tsai, Yuhsiang M. [3 ]
Wang, Weichung [3 ]
机构
[1] Karlsruhe Inst Technol, Karlsruhe, Germany
[2] Univ Tennessee, Knoxville, TN 37996 USA
[3] Natl Taiwan Univ, Taipei, Taiwan
[4] Oak Ridge Natl Lab, Oak Ridge, TN USA
[5] Univ Manchester, Manchester, Lancs, England
[6] Univ Jaume 1, Castellon De La Plana, Spain
[7] Univ Tokyo, Tokyo, Japan
关键词
Sparse Matrix Vector Product (SpMV); irregular matrices; GPUs; MULTIPLICATION;
D O I
10.1145/3380930
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA's cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
引用
收藏
页数:26
相关论文
共 29 条
[1]  
Angerson E., 1990, Proceedings of Supercomputing '90 (Cat. No.90CH2916-5), P2, DOI 10.1109/SUPERC.1990.129995
[2]  
[Anonymous], 2017, WHIT NVIDIA TESLA V1
[3]  
[Anonymous], 2012, Google's pagerank and beyond: The science of search engine rankings
[4]  
Anzt H., 2014, Tech. Rep. ut-eecs-14-727
[5]   Towards Continuous Benchmarking: An Automated Performance Evaluation Framework for High Performance Software [J].
Anzt, Hartwig ;
Chen, Yen-Chen ;
Cojean, Terry ;
Dongarra, Jack ;
Flegar, Goran ;
Nayak, Pratik ;
Quintana-Orti, Enrique S. ;
Tsai, Yuhsiang M. ;
Wang, Weichung .
PROCEEDINGS OF THE PLATFORM FOR ADVANCED SCIENTIFIC COMPUTING CONFERENCE (PASC '19), 2019,
[6]   Preconditioned Krylov solvers on GPUs [J].
Anzt, Hartwig ;
Gates, Mark ;
Dongarra, Jack ;
Kreutzer, Moritz ;
Wellein, Gerhard ;
Koehler, Martin .
PARALLEL COMPUTING, 2017, 68 :32-44
[7]  
Barrett R., 1994, TEMPLATES SOLUTION L, DOI DOI 10.1137/1.9781611971538
[8]  
Bell N, 2009, STUDENTS GUIDE TO THE MA TESOL, P1
[9]  
Brin S, 1998, TECHNICAL REPORT, V1998, P161, DOI DOI 10.1007/978-3-319-08789-4_10
[10]   A survey on feature selection methods [J].
Chandrashekar, Girish ;
Sahin, Ferat .
COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) :16-28