Heterogeneous sparse matrix-vector multiplication via compressed sparse row format

被引:2
作者
Lane, Phillip Allen [1 ]
Booth, Joshua Dennis [1 ]
机构
[1] Univ Alabama, 301 Sparkman Dr NW, Huntsville, AL 35899 USA
基金
美国国家科学基金会;
关键词
Sparse linear algebra; Sparse matrix vector multiplication; GPU; Heterogenous; OpenMP; CUDA; ALGORITHM;
D O I
10.1016/j.parco.2023.102997
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Sparse matrix-vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR-k, that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 838 and AMD Epyc 7742 CPUs while still outperforming NVIDIA's cuSPARSE and Sandia National Laboratories' KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance <= 10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR-k achieves this with reordering and by grouping rows into a hierarchical structure of super-rows and super-super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super-super-rows sizes in constant time.
引用
收藏
页数:13
相关论文
共 27 条
  • [1] Aliaga Jose I., CONCURR COMP-PRACT E, Ve6515
  • [2] [Anonymous], 2003, THESIS U CALIFORNIA
  • [3] A SPECTRAL ALGORITHM FOR ENVELOPE REDUCTION OF SPARSE MATRICES
    BARNARD, ST
    POTHEN, A
    SIMON, H
    [J]. NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS, 1995, 2 (04) : 317 - 334
  • [4] Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures
    Chen, Shizhao
    Fang, Jianbin
    Chen, Donglin
    Xu, Chuanfu
    Wang, Zheng
    [J]. IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 649 - 658
  • [5] Cuthill E., 1969, P 1969 24 NAT C, DOI DOI 10.1145/800195.805928
  • [6] The University of Florida Sparse Matrix Collection
    Davis, Timothy A.
    Hu, Yifan
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2011, 38 (01):
  • [7] A survey of direct methods for sparse linear systems
    Davis, Timothy A.
    Rajamanickam, Sivasankaran
    Sid-Lakhdar, Wissam M.
    [J]. ACTA NUMERICA, 2016, 25 : 383 - 566
  • [8] THE EFFECT OF ORDERING ON PRECONDITIONED CONJUGATE GRADIENTS
    DUFF, IS
    MEURANT, GA
    [J]. BIT, 1989, 29 (04): : 635 - 657
  • [9] Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory Parallel Architectures
    Eberhardt, Ryan
    Hoemmen, Mark
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 663 - 672
  • [10] Kokkos: Enabling performance portability across manycore architectures
    Edwards, H. Carter
    Trott, Christian R.
    [J]. 2013 EXTREME SCALING WORKSHOP (XSW 2013), 2014, : 18 - 24