TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs

被引:0
作者
Zhengyang Lu
Weifeng Liu
机构
[1] China University of Petroleum-Beijing,Super Scientific Software Laboratory
来源
CCF Transactions on High Performance Computing | 2023年 / 5卷
关键词
Sparse matrix; Sparse triangular solve; Tiled algorithm; GPU;
D O I
暂无
中图分类号
学科分类号
摘要
Sparse triangular solve (SpTRSV) is one of the most important level-2 kernels in sparse basic linear algebra subprograms (BLAS). Compared to another level-2 sparse BLAS kernel sparse matrix–vector multiplication (SpMV), SpTRSV is in general more difficult to find high parallelism on many-core processors, such as GPUs. Nowadays, much work focuses on reducing dependencies and synchronizations in the level-set and Sync-free algorithms for SpTRSV. However, there is less work that can make good use of sparse spatial structure for SpTRSV on GPUs. In this paper, we propose a tiled algorithm called TileSpTRSV for optimizing SpTRSV on GPUs through exploiting 2D spatial structure of sparse matrices. We design two algorithm implementations, i.e., TileSpTRSV_level-set and TileSpTRSV_sync-free, for TileSpTRSV on top of level-set and Sync-free algorithms, respectively. By testing 16 representative matrices on a latest NVIDIA GPU, the experimental results show that TileSpTRSV_level-set gives on average 5.29×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} (up to 38.10×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}), 5.33×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} (up to 21.32×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}) and 2.62×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} (up to 12.87×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}) speedups over cuSPARSE, Sync-free and Recblock algorithms on the 16 representative matrices, respectively.
引用
收藏
页码:129 / 143
页数:14
相关论文
共 37 条
[1]  
Ahmad N(2021)A split execution model for sptrsv IEEE Trans. Parallel Distrib. Syst. 32 2809-2822
[2]  
Yilmaz B(1989)Solving sparse triangular linear systems on parallel computers Int. J. High Speed Comput. 1 73-95
[3]  
Unat D(2016)Domain overlap for iterative sparse triangular solves on GPUs Softw. Exascale Comput. SPPEXA 2013–2015 527-545
[4]  
Anderson E(2018)ParILUT—a new parallel threshold ILU factorization SIAM J. Sci. Comput. 40 C503-C519
[5]  
Saad Y(2018)Incomplete sparse approximate inverses for parallel preconditioning Parallel Comput. 71 1-22
[6]  
Anzt H(2007)Performance optimization and modeling of blocked sparse kernels Int. J. High Perform. Comput. Appl. 21 467-484
[7]  
Chow E(2011)The University of Florida sparse matrix collection ACM Trans. Math. Softw. 38 11-125
[8]  
Szyld DB(2005)An overview of SuperLU: algorithms, implementation, and user interface ACM Trans. Math. Softw. 31 302-325
[9]  
Anzt H(2013)GPU-accelerated preconditioned iterative linear solvers J. Supercomput. 63 443-466
[10]  
Chow E(2015)A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors J. Parallel Distrib. Comput. 85 47-61