Hybrid static/dynamic scheduling for already optimized dense matrix factorization

被引:9
作者
Donfack, Simplice [1 ]
Grigori, Laura [1 ]
Gropp, William D. [2 ]
Kale, Vivek [2 ]
机构
[1] Univ Paris 11, INRIA Saclay Ile France, Bat 425, F-91405 Orsay, France
[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
来源
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) | 2012年
关键词
dynamic scheduling; communication-avoiding; LU factorization; numerical linear algebra; LOCALITY;
D O I
10.1109/IPDPS.2012.53
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems.
引用
收藏
页码:496 / 507
页数:12
相关论文
共 17 条
  • [1] Scheduling multithreaded computations by work stealing
    Blumofe, RD
    Leiserson, CE
    [J]. JOURNAL OF THE ACM, 1999, 46 (05) : 720 - 748
  • [2] BLUMOFE RD, 1995, SIGPLAN NOTICES, V30, P207
  • [3] A class of parallel tiled linear algebra algorithms for multicore architectures
    Buttari, Alfredo
    Langou, Julien
    Kurzak, Jakub
    Dongarra, Jack
    [J]. PARALLEL COMPUTING, 2009, 35 (01) : 38 - 53
  • [4] Donfack S., 2010, PAR DISTR PROC IPDPS, P1, DOI DOI 10.1109/IPDPS.2010.5470348
  • [5] Frigo M., 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), P285, DOI 10.1109/SFFCS.1999.814600
  • [6] Ganapathi A., 2009, Proceedings of the First USENIX Conference on Hot Topics in Parallelism, P1
  • [7] Grigori L., 2008, P 2008 ACM IEEE C SU, P29
  • [8] Recursion leads to automatic variable blocking for dense linear-algebra algorithms
    Gustavson, FG
    [J]. IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1997, 41 (06) : 737 - 755
  • [9] Hoefler T., 2010, INT C HIGH PERF COMP
  • [10] Kale V, 2010, LECT NOTES COMPUT SC, V6305, P229, DOI 10.1007/978-3-642-15646-5_24