Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems

被引:29
作者
Agullo, Emmanuel [1 ,5 ]
Buttari, Alfredo [2 ,6 ]
Guermouche, Abdou [3 ,7 ]
Lopez, Florent [4 ,6 ,8 ]
机构
[1] Inria LaBRI, Talence, France
[2] CNRS IRIT, Toulouse, France
[3] Univ Bordeaux, Talence, France
[4] UPS IRIT, Toulouse, France
[5] Ctr Rech Inria Bordeaux Sud Ouest, 200 Ave Vieille Tour, F-33405 Talence, France
[6] ENSEEIHT IRIT, 2 Rue Camichel, F-31071 Toulouse, France
[7] LaBRI, 351 Cours Liberat, F-33405 Talence, France
[8] STFC Rutherford Appleton Lab, Harwell Campus, Didcot OX11 0QX, Oxon, England
来源
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE | 2016年 / 43卷 / 02期
关键词
Algorithms; Performance; Sparse direct solvers; multicores; runtime systems; communication-avoiding; memory-aware; QR FACTORIZATION;
D O I
10.1145/2898348
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This article evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications, namely, sparse matrix multifrontal factorizations that feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system.
引用
收藏
页数:22
相关论文
共 39 条
[31]  
Lacoste X., 2014, RR8446 INRIA
[32]   A COMPACT ROW STORAGE SCHEME FOR CHOLESKY FACTORS USING ELIMINATION TREES [J].
LIU, JWH .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1986, 12 (02) :127-148
[33]  
Pellegrini F., 1996, High-Performance Computing and Networking. International Conference and Exhibition HPCN EUROPE 1996. Proceedings, P493, DOI 10.1007/3-540-61142-8_588
[34]   Programming Matrix Algorithms-by-Blocks for Thread-Level Parallelism [J].
Quintana-Orti, Gregorio ;
Quintana-Orti, Enrique S. ;
Van de Geijn, Robert A. ;
Van Zee, Field G. ;
Chan, Ernie .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2009, 36 (03)
[35]  
Rouet F.-H., 2012, THESIS
[36]   A NEW IMPLEMENTATION OF SPARSE GAUSSIAN-ELIMINATION [J].
SCHREIBER, R .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1982, 8 (03) :256-276
[37]  
The OpenMP architecture review board, 2013, OPENMP 4 0 COMPL SPE
[38]   Performance-effective and low-complexity task scheduling for heterogeneous computing [J].
Topcuoglu, H ;
Hariri, S ;
Wu, MY .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2002, 13 (03) :260-274
[39]  
Yeralan Sencer, 2015, ACM T MATH UNPUB JAN