Flexible Linear Algebra Development and Scheduling with Cholesky Factorization

被引：3

作者：

Haidar, Azzam ^{[1
]}

YarKhan, Asim ^{[1
]}

Cao, Chongxiao ^{[1
]}

Luszczek, Piotr ^{[1
]}

Tomov, Stanimire ^{[1
]}

Dongarra, Jack ^{[1
,2
,3
]}

机构：

[1] Univ Tennessee, Knoxville, TN 37996 USA

[2] Oak Ridge Natl Lab, Oak Ridge, TN USA

[3] Univ Manchester, Manchester M13 9PL, Lancs, England

来源：

2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS) | 2015年

基金：

俄罗斯科学基金会;

关键词：

Cholesky factorization; accelerator-based distributed memory computers; superscalar dataflow scheduling; heterogeneous HPC computing;

D O I：

10.1109/HPCC-CSS-ICESS.2015.285

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore CPUs and GPUs. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task-programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization.

引用

页码：861 / 864

页数：4

共 19 条

[1]

Agullo E., 2010, TECH REP

[2]

Agullo E, 2011, LECT NOTES COMPUT SC, V6853, P194, DOI 10.1007/978-3-642-23397-5_19

[3]

ANDERSON E., 1999, LAPACK USERSGUIDE, V3rd

[4]

[Anonymous], PARA 95

[5]

Augonnet C., 2011, THESIS

[6]

Augonnet C, 2009, LECT NOTES COMPUT SC, V5704, P863, DOI 10.1007/978-3-642-03869-3_80

[7]

Ayguadé E, 2009, LECT NOTES COMPUT SC, V5704, P851, DOI 10.1007/978-3-642-03869-3_79

[8]

Bosilca G., 2010, TECH REP

[9]

Chan E, 2007, SPAA'07: PROCEEDINGS OF THE NINETEENTH ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P116

[10] OpenMP: An industry standard API for shared-memory programming [J].

Dagum, L ;

Menon, R .

IEEE COMPUTATIONAL SCIENCE & ENGINEERING, 1998, 5 (01) :46-55

← 1 2 →