CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

被引:38
作者
Dang, Hoang-Vu [1 ]
Schmidt, Bertil [1 ]
机构
[1] Johannes Gutenberg Univ Mainz, Inst Informat, D-55128 Mainz, Germany
关键词
GPU computing; CUDA; Sparse matrices; Sparse Matrix-Vector Multiplication; Scientific programming;
D O I
10.1016/j.parco.2013.09.005
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Existing formats for Sparse Matrix-Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700 K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:737 / 750
页数:14
相关论文
共 18 条
  • [1] Baskaran M.M., 2008, OPTIMIZING SPARSE MA
  • [2] Bell N, 2009, STUDENTS GUIDE TO THE MA TESOL, P1
  • [3] Bell N., 2012, CUSP: Generic parallel algorithms for sparse matrix and graph computations
  • [4] Concurrent number cruncher: a GPU implementation of a general sparse linear solver
    Buatois, Luc
    Caumon, Guillaume
    Levy, Bruno
    [J]. INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2009, 24 (03) : 205 - 223
  • [5] Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs
    Choi, Jee W.
    Singh, Amik
    Vuduc, Richard W.
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (05) : 115 - 125
  • [6] The Sliced COO format for Sparse Matrix-Vector Multiplication on CUDA-enabled GPUs
    Dang, Hoang-Vu
    Schmidt, Bertil
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012, 2012, 9 : 57 - 66
  • [7] SPARSE-MATRIX TEST PROBLEMS
    DUFF, IS
    GRIMES, RG
    LEWIS, JG
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1989, 15 (01): : 1 - 14
  • [8] Kohl J., 2008, MATVIEW SCALABLE SPA
  • [9] Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
    Kreutzer, Moritz
    Hager, Georg
    Wellein, Gerhard
    Fehske, Holger
    Basermann, Achim
    Bishop, Alan R.
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 1696 - 1702
  • [10] Matam K. K., 2011, 2011 International Conference on Parallel Processing, P612, DOI 10.1109/ICPP.2011.82