Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

被引:12
作者
Rubensson, Emanuel H. [1 ]
Rudberg, Elias [1 ]
机构
[1] Uppsala Univ, Div Comp Sci, Dept Informat Technol, Box 337, SE-75105 Uppsala, Sweden
基金
瑞典研究理事会;
关键词
Parallel computing; Sparse matrix-matrix multiplication; Scalable algorithms; Large-scale computing; Graphics processing units; DENSITY-MATRIX; IMPLEMENTATION; PERFORMANCE; DESIGN; SYSTEM; COSTS;
D O I
10.1016/j.parco.2016.06.005
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present a method for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation, data locality is exploited without prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case graphics processing units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node. The performance is evaluated for matrices with different sparsity structures, including examples from electronic structure calculations. Compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:87 / 106
页数:20
相关论文
共 55 条
  • [1] SIMULTANEOUS INPUT AND OUTPUT MATRIX PARTITIONING FOR OUTER-PRODUCT-PARALLEL SPARSE MATRIX-MATRIX MULTIPLICATION
    Akbudak, Kadir
    Aykanat, Cevdet
    [J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2014, 36 (05) : C568 - C590
  • [2] [Anonymous], 1984, SIGSAM B
  • [3] [Anonymous], ARXIV151000844
  • [4] StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
    Augonnet, Cedric
    Thibault, Samuel
    Namyst, Raymond
    Wacrenier, Pierre-Andre
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2011, 23 (02) : 187 - 198
  • [5] Ballard G., 2013, Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures. SPAA '13, P222
  • [6] A scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators
    Beri, Tarun
    Bansal, Sorav
    Kumar, Subodh
    [J]. 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 146 - 155
  • [7] Scheduling multithreaded computations by work stealing
    Blumofe, RD
    Leiserson, CE
    [J]. JOURNAL OF THE ACM, 1999, 46 (05) : 720 - 748
  • [8] Cilk: An efficient multithreaded runtime system
    Blumofe, RD
    Joerg, CF
    Kuszmaul, BC
    Leiserson, CE
    Randall, KH
    Zhou, YL
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1996, 37 (01) : 55 - 69
  • [9] AN OPTIMIZED SPARSE APPROXIMATE MATRIX MULTIPLY FOR MATRICES WITH DECAY
    Bock, Nicolas
    Challacombe, Matt
    [J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2013, 35 (01) : C72 - C98
  • [10] Sparse matrix multiplication: The distributed block-compressed sparse row library
    Borstnik, Urban
    VandeVondele, Joost
    Weber, Valery
    Hutter, Juerg
    [J]. PARALLEL COMPUTING, 2014, 40 (5-6) : 47 - 58