Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

被引：0

作者：

Youcef Barigou

Edgar Gabriel

机构：

[1] University of Houston,Department of Computer Science

来源：

International Journal of Parallel Programming | 2017年 / 45卷

关键词：

Non-blocking collective operations; Communication-computation overlap; Auto-tuning; MPI; OpenMP;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Non-blocking collective communication operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. They are often considered key building blocks for scaling applications to very large process counts. Yet, using non-blocking collective operations in real-world applications is non-trivial. Application codes often have to be restructured significantly in order to maximize the communication–computation overlap. This paper presents an approach to maximize the communication–computation overlap for hybrid OpenMP/MPI applications. The work leverages automatic parallelization by extending the ability of an existing tool to utilize non-blocking collective operations. It further integrates run-time auto-tuning techniques of non-blocking collective operations, optimizing both, the algorithms used for the non-blocking collective operations as well as location and frequency of accompanying progress function calls. Four application benchmarks were used to demonstrate the efficiency and versatility of the approach on two different platforms. The results indicate significant performance improvements in virtually all test scenarios. The resulting parallel applications achieved a performance improvement of up to 43% compared to the version using blocking communication operations, and up to 95% of the maximum theoretical communication–computation overlap identified for each scenario.

引用

页码：1390 / 1416

页数：26

共 23 条

[1]

Dagum L(2002)OpenMP: an industry standard API for shared-memory programming IEEE Comput. Sci. Eng. 5 46-55

[2]

Menon R(2008)A study of process arrival patterns for MPI collective operations Int. J. Parallel Prog. 36 543-570

[3]

Faraj A(1991)Dataflow analysis of scalar and array references Int. J. Parallel Prog. 20 23-53

[4]

Patarasuk P(2010)Towards performance portability through runtime adaption for high performance computing applications Concurr. Comput. Pract. Exp. 22 2230-2246

[5]

Yuan X(2011)High-performance and scalable non-blocking all-to-all with collective offload on infiniband clusters: a study with parallel 3d fft Comput. Sci. Res. Dev. 26 237-246

[6]

Feautrier P(2011)Auto-tuning full applications: a case study Int. J. High Perform. Comput. Appl. 25 286-294

[7]

Gabriel E(undefined)undefined undefined undefined undefined-undefined

[8]

Feki S(undefined)undefined undefined undefined undefined-undefined

[9]

Benkert K(undefined)undefined undefined undefined undefined-undefined

[10]

Resch MM(undefined)undefined undefined undefined undefined-undefined

← 1 2 3 →