Sparsity: Optimization framework for sparse matrix kernels

被引:181
作者
Im, EJ [1 ]
Yelick, K
Vuduc, R
机构
[1] Kookmin Univ, Sch Comp Sci, Seoul, South Korea
[2] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
关键词
sparse matrix; performance tuning; memory hierarchy;
D O I
10.1177/1094342004041296
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse matrix-vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix. The SPARSITY system is designed to address these problems by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. SPARSITY combines traditional techniques such as loop transformations with data structure transformations and optimization heuristics that are specific to sparse matrices. It provides a novel framework for selecting optimization parameters, such as block size, using a combination of performance models and search. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems. Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the non-zero structure is random. For applications involving multiple vectors, reorganizing the computation to perform the entire set of multiplications as a single operation produces significant speedups. We describe the different optimizations and parameter selection techniques and evaluate them on several machines using over 40 matrices taken from a broad set of application domains. Our results demonstrate speedups of up to 4x for the single vector case and up to 10x for the multiple vector case.
引用
收藏
页码:135 / 158
页数:24
相关论文
共 28 条
  • [1] Bai Z., 2000, TEMPLATES SOLUTION A, DOI DOI 10.1137/1.9780898719581
  • [2] BAKER AH, 2003, CUCS04503 U COL DEP
  • [3] Balay S., 2000, ANL9511
  • [4] Using linear algebra for intelligent information retrieval
    Berry, MW
    Dumais, ST
    OBrien, GW
    [J]. SIAM REVIEW, 1995, 37 (04) : 573 - 595
  • [5] BIK AJC, 1996, THESIS LEIDEN U
  • [6] BILMES J, 1997, INT C SUP VIENN AUST
  • [7] *BLAST FOR, 1999, DOC BAS LIN ALG SUBP
  • [8] Golub GH, 1977, Mathematical Software, P361
  • [9] GOTO K, 2002, TR200255 U TEX AUST
  • [10] A SHIFTED BLOCK LANCZOS-ALGORITHM FOR SOLVING SPARSE SYMMETRICAL GENERALIZED EIGENPROBLEMS
    GRIMES, RG
    LEWIS, JG
    SIMON, HD
    [J]. SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS, 1994, 15 (01) : 228 - 272