TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs

被引：38

作者：

Niu, Yuyao ^{[1
]}

Lu, Zhengyang ^{[1
]}

Ji, Haonan ^{[1
]}

Song, Shuhui ^{[1
]}

Jin, Zhou ^{[1
]}

Liu, Weifeng ^{[1
]}

机构：

[1] China Univ Petr, Super Sci Software Lab, Beijing, Peoples R China

来源：

PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING | 2022年

基金：

中国国家自然科学基金;

关键词：

Sparse matrix; SpGEMM; tiled algorithm; GPU; LINEAR ALGEBRA SUBPROGRAMS; MANY-CORE; VECTOR MULTIPLICATION; IMPLEMENTATION; LIBRARY; DESIGN; MODEL;

D O I：

10.1145/3503221.3508431

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. The existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection. We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. Thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. Thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. Thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.

引用

页码：90 / 106

页数：17

共 120 条

[1] Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication [J].

Akbudak, Kadir ;

Selvitopi, Oguz ;

Aykanat, Cevdet .

ACM TRANSACTIONS ON PARALLEL COMPUTING, 2018, 4 (03)

[2] Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures [J].

Akbudak, Kadir ;

Aykanat, Cevdet .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (08) :2258-2271

[3]

Anh P. N. Q., 2016, ICS 16

[4]

Azad A., 2018, HIPMCL HIGH PERFORMA

[5] Combinatorial BLAS 2.0: Scaling Combinatorial Algorithms on Distributed-Memory Systems [J].

Azad, Ariful ;

Selvitopi, Oguz ;

Hussain, Md Taufique ;

Gilbert, John R. ;

Buluc, Aydin .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (04) :989-1001

[6] A work-efficient parallel sparse matrix-sparse vector multiplication algorithm [J].

Azad, Ariful ;

Buluc, Aydin .

2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, :688-697

[7] EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIX-MATRIX MULTIPLICATION [J].

Azad, Ariful ;

Ballard, Grey ;

Buluc, Aydin ;

Demmel, James ;

Grigori, Laura ;

Schwartz, Oded ;

Toledo, Sivan ;

Williams, Samuel .

SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2016, 38 (06) :C624-C651

[8] Parallel Triangle Counting and Enumeration using Matrix Algebra [J].

Azad, Ariful ;

Buluc, Aydin ;

Gilbert, John .

2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, 2015, :804-811

[9]

Baker A. H., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P275, DOI 10.1109/IPDPS.2011.35

[10] Communication lower bounds and optimal algorithms for numerical linear algebra [J].

Ballard, G. ;

Carson, E. ;

Demmel, J. ;

Hoemmen, M. ;

Knight, N. ;

Schwartz, O. .

ACTA NUMERICA, 2014, 23 :1-155

← 1 2 3 4 5 6 7 8 9 10 →