Communication Avoiding Algorithms: Analysis and Code Generation for Parallel Systems

被引：3

作者：

Murthy, Karthik ^{[1
]}

Mellor-Crummey, John ^{[1
]}

机构：

[1] Rice Univ, Dept Comp Sci, Houston, TX USA

来源：

2015 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION (PACT) | 2015年

关键词：

Compilers; polyhedral methods; reductions; modulo operations; optimization; parallel code generation;

D O I：

10.1109/PACT.2015.41

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Data movement is a critical bottleneck for future generations of parallel systems. The class of .5D communication-avoiding algorithms were developed to address this bottleneck. These algorithms reduce communication and provide strong scaling in both time and energy. As a first step towards automating the development of communication-avoiding libraries, we developed the Maunam compiler. Maunam generates efficient parallel code from a high-level, global view sketch of .5D algorithms that are expressed using symbolic data sizes and numbers of processors. It supports the expression of data movement and communication through high-level global operations such as TILT and CSHIFT as well as through element-wise copy operations. With the latter, wraparound communication patterns can also be achieved using subscripts based on modulo operations. Maunam employs polyhedral analysis to reason about communication and computation present in a .5D algorithm. After partitioning data and computation, it inserts point-to-point and collective communication as needed. Maunam also analyzes data dependence patterns and data layouts to identify reductions over processor subsets. Maunam-generated Fortran+MPI code for 2.5D matrix multiplication running on 4096 cores of a Cray XC30 supercomputer achieves 59 TFlops/s (76% of the machine peak). Our generated parallel code achieves 91% of the performance of a hand-coded version.

引用

页码：150 / 162

页数：13

共 31 条

[1] Adve V., 2001, Compiler optimizations for scalable parallel systems. Languages, compilation techniques, and run time systems, P553
[2] Using integer sets for data-parallel program analysis and optimization
Adve, V
Mellor-Crummey, J
[J]. ACM SIGPLAN NOTICES, 1998, 33 (05) : 186 - 198
[3] A three-dimensional approach to parallel matrix multiplication
Agarwal, RC
Balle, SM
Gustavson, FG
Joshi, M
Palkar, P
[J]. IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1995, 39 (05) : 575 - 582
[4] [Anonymous], 1993, SIGPLAN Fortran Forum, V12, P1, DOI DOI 10.1145/174223.158909
[5] [Anonymous], 2008, EXASCALE COMPUTING S
[6] [Anonymous], P ANN IEEE ACM INT S
[7] Balaji P, 2009, LECT NOTES COMPUT SC, V5759, P20, DOI 10.1007/978-3-642-03770-2_9
[8] Ballard G, 2011, SPAA 11: PROCEEDINGS OF THE TWENTY-THIRD ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P1
[9] Code generation in the polyhedral model is easier than you think
Bastoul, C
[J]. 13TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES, PROCEEDINGS, 2004, : 7 - 16
[10] Basu P, 2013, INT C HIGH PERFORM, P452, DOI 10.1109/HiPC.2013.6799131

← 1 2 3 4 →