Load-balancing Sparse Matrix Vector Product Kernels on GPUs

被引:31
作者
Anzt, Hartwig [1 ,2 ]
Cojean, Terry [1 ]
Chen, Yen-Chen [3 ,7 ]
Dongarra, Jack [2 ,4 ,5 ]
Flegar, Goran [6 ]
Nayak, Pratik [1 ]
Tomov, Stanimire [2 ]
Tsai, Yuhsiang M. [3 ]
Wang, Weichung [3 ]
机构
[1] Karlsruhe Inst Technol, Karlsruhe, Germany
[2] Univ Tennessee, Knoxville, TN 37996 USA
[3] Natl Taiwan Univ, Taipei, Taiwan
[4] Oak Ridge Natl Lab, Oak Ridge, TN USA
[5] Univ Manchester, Manchester, Lancs, England
[6] Univ Jaume 1, Castellon De La Plana, Spain
[7] Univ Tokyo, Tokyo, Japan
关键词
Sparse Matrix Vector Product (SpMV); irregular matrices; GPUs; MULTIPLICATION;
D O I
10.1145/3380930
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA's cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
引用
收藏
页数:26
相关论文
共 29 条
[11]   An Online Holistic Scheduling Framework for Energy-Constrained Wireless Real-Time Systems [J].
Chantem, Thidapat ;
Yi, Jun ;
Hong, Shengyan ;
Hu, X. Sharon ;
Poellabauer, Christian ;
Zhang, Liqiang .
2011 IEEE 17TH INTERNATIONAL CONFERENCE ON EMBEDDED AND REAL-TIME COMPUTING SYSTEMS AND APPLICATIONS (RTCSA 2011), VOL 1, 2011, :267-276
[12]  
Chow E., 2016, 291 LAPACK
[13]   Optimizing Sparse Matrix Operations on GPUs using Merge Path [J].
Dalton, Steven ;
Olson, Luke ;
Baxter, Sean ;
Merrill, Duane ;
Garland, Michael .
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, :407-416
[14]   Estimating an eigenvector by the power method with a random start [J].
DelCorso, GM .
SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS, 1997, 18 (04) :913-937
[15]   Sparse Matrix-Vector Multiplication on GPGPUs [J].
Filippone, Salvatore ;
Cardellini, Valeria ;
Barbieri, Davide ;
Fanfarillo, Alessandro .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2017, 43 (04)
[16]   Overcoming Load Imbalance for Irregular Sparse Matrices [J].
Flegar, Goran ;
Anzt, Hartwig .
PROCEEDINGS OF IA3 2017: SEVENTH WORKSHOP ON IRREGULAR APPLICATIONS: ARCHITECTURES AND ALGORITHMS, 2017,
[17]   Balanced CSR Sparse Matrix-Vector Product on Graphics Processors [J].
Flegar, Goran ;
Quintana-Orti, Enrique S. .
EURO-PAR 2017: PARALLEL PROCESSING, 2017, 10417 :697-709
[18]   A Note on Performance Profiles for Benchmarking Software [J].
Gould, Nicholas ;
Scott, Jennifer .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2016, 43 (02)
[19]  
Grossman Max, 2016, ABS160800636 CORR
[20]  
Higham D. J., 2005, MATLAB GUIDE, V150, DOI [10.1137/1.9780898717891, DOI 10.1137/1.9780898717891]