XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

被引:100
作者
Gautier, Thierry [1 ]
Lima, Joao V. F. [2 ,3 ]
Maillard, Nicolas [3 ]
Raffin, Bruno [1 ]
机构
[1] INRIA, Grenoble, France
[2] Univ Grenoble, F-38041 Grenoble, France
[3] Univ Fed Rio Grande do Sul, BR-90046900 Porto Alegre, RS, Brazil
来源
IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013) | 2013年
关键词
High Performance Computing; Data-Flow task model; Heterogeneous architectures; Locality Aware Work Stealing; Dense Linear Algebra; MULTI-GPU; PLATFORMS;
D O I
10.1109/IPDPS.2013.66
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.
引用
收藏
页码:1299 / 1308
页数:10
相关论文
共 25 条
[1]  
Acar U. A., 2000, SPAA 2000. Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures, P1, DOI 10.1145/341800.341801
[2]  
[Anonymous], ANN OPERATIONS RES
[3]   Thread scheduling for multiprogrammed multiprocessors [J].
Arora, NS ;
Blumofe, RD ;
Plaxton, CG .
THEORY OF COMPUTING SYSTEMS, 2001, 34 (02) :115-144
[4]   StarPU: a unified platform for task scheduling on heterogeneous multicore architectures [J].
Augonnet, Cedric ;
Thibault, Samuel ;
Namyst, Raymond ;
Wacrenier, Pierre-Andre .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2011, 23 (02) :187-198
[5]  
Ayguadé E, 2009, LECT NOTES COMPUT SC, V5704, P851, DOI 10.1007/978-3-642-03869-3_79
[6]   Parallelizing dense and banded linear algebra libraries using SMPSs [J].
Badia, Rosa M. ;
Herrero, Jose R. ;
Labarta, Jesus ;
Perez, Josep M. ;
Quintana-Orti, Enrique S. ;
Quintana-Orti, Gregorio .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2009, 21 (18) :2438-2456
[7]   Cilk: An efficient multithreaded runtime system [J].
Blumofe, RD ;
Joerg, CF ;
Kuszmaul, BC ;
Leiserson, CE ;
Randall, KH ;
Zhou, YL .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1996, 37 (01) :55-69
[8]  
Boeres C., 1995, Parallel Algorithms for Irregularly Structured Problems. Second International Workshop, IRREGULAR '95. Proceedings, P159
[9]   DAGuE: A generic distributed DAG engine for High Performance Computing [J].
Bosilca, George ;
Bouteiller, Aurelien ;
Danalis, Anthony ;
Herault, Thomas ;
Lemarinier, Pierre ;
Dongarra, Jack .
PARALLEL COMPUTING, 2012, 38 (1-2) :37-51
[10]  
Broquedis F., 2012, P 8 INT C OPENMP HET, P102