TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs

被引：0

作者：

Zhengyang Lu

Weifeng Liu

机构：

[1] China University of Petroleum-Beijing,Super Scientific Software Laboratory

来源：

CCF Transactions on High Performance Computing | 2023年 / 5卷

关键词：

Sparse matrix; Sparse triangular solve; Tiled algorithm; GPU;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Sparse triangular solve (SpTRSV) is one of the most important level-2 kernels in sparse basic linear algebra subprograms (BLAS). Compared to another level-2 sparse BLAS kernel sparse matrix–vector multiplication (SpMV), SpTRSV is in general more difficult to find high parallelism on many-core processors, such as GPUs. Nowadays, much work focuses on reducing dependencies and synchronizations in the level-set and Sync-free algorithms for SpTRSV. However, there is less work that can make good use of sparse spatial structure for SpTRSV on GPUs. In this paper, we propose a tiled algorithm called TileSpTRSV for optimizing SpTRSV on GPUs through exploiting 2D spatial structure of sparse matrices. We design two algorithm implementations, i.e., TileSpTRSV_level-set and TileSpTRSV_sync-free, for TileSpTRSV on top of level-set and Sync-free algorithms, respectively. By testing 16 representative matrices on a latest NVIDIA GPU, the experimental results show that TileSpTRSV_level-set gives on average 5.29×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} (up to 38.10×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}), 5.33×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} (up to 21.32×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}) and 2.62×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} (up to 12.87×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}) speedups over cuSPARSE, Sync-free and Recblock algorithms on the 16 representative matrices, respectively.

引用

页码：129 / 143

页数：14

共 37 条

[1]

Ahmad N(2021)A split execution model for sptrsv IEEE Trans. Parallel Distrib. Syst. 32 2809-2822

[2]

Yilmaz B(1989)Solving sparse triangular linear systems on parallel computers Int. J. High Speed Comput. 1 73-95

[3]

Unat D(2016)Domain overlap for iterative sparse triangular solves on GPUs Softw. Exascale Comput. SPPEXA 2013–2015 527-545

[4]

Anderson E(2018)ParILUT—a new parallel threshold ILU factorization SIAM J. Sci. Comput. 40 C503-C519

[5]

Saad Y(2018)Incomplete sparse approximate inverses for parallel preconditioning Parallel Comput. 71 1-22

[6]

Anzt H(2007)Performance optimization and modeling of blocked sparse kernels Int. J. High Perform. Comput. Appl. 21 467-484

[7]

Chow E(2011)The University of Florida sparse matrix collection ACM Trans. Math. Softw. 38 11-125

[8]

Szyld DB(2005)An overview of SuperLU: algorithms, implementation, and user interface ACM Trans. Math. Softw. 31 302-325

[9]

Anzt H(2013)GPU-accelerated preconditioned iterative linear solvers J. Supercomput. 63 443-466

[10]

Chow E(2015)A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors J. Parallel Distrib. Comput. 85 47-61

← 1 2 3 4 →