GPUIterator: Bridging the Gap between Chapel and GPU Platforms

被引：7

作者：

Hayashi, Akihiro ^{[1
]}

Paul, Sri Raj ^{[2
]}

Sarkar, Vivek ^{[2
]}

机构：

[1] Rice Univ, Dept Comp Sci, Houston, TX 77251 USA

[2] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA

来源：

CHIUW'19: PROCEEDINGS OF THE ACM SIGPLAN 6TH CHAPEL IMPLEMENTERS AND USERS WORKSHOP | 2019年

关键词：

Chapel; GPU; Parallel Iterators;

D O I：

10.1145/3329722.3330142

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.

引用

页码：2 / 11

页数：10

共 17 条

[1] Parameterized Diamond Tiling for Stencil Computations with Chapel parallel iterators [J].

Bertolacci, Ian J. ;

Olschanowsky, Catherine ;

Harshbarger, Ben ;

Chamberlain, Bradford L. ;

Wonnacott, David G. ;

Strout, Michelle Mills .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, :197-206

[2]

Carlson William W, 1999, TECHNICAL REPORT

[3]

Chamberlain Bradford L, 2011, PGAS 11

[4]

Chamberlain Bradford L., 2007, MULTIRESOLUTION LANG

[5]

Chamberlain BradfordL., 2011, Encyclopedia of Parallel Computing, P249, DOI DOI 10.1007/978-0-387-09766-4_54

[6] X10: An object-oriented approach to non-uniform cluster computing [J].

Charles, P ;

Donawa, C ;

Ebcioglu, K ;

Grothoff, C ;

Kielstra, A ;

von Praun, C ;

Saraswat, V ;

Sarkar, V .

ACM SIGPLAN NOTICES, 2005, 40 (10) :519-538

[7] Integrating Asynchronous Task Parallelism with MPI [J].

Chatterjee, Sanjay ;

Tasirlar, Sagnak ;

Budimlic, Zoran ;

Cave, Vincent ;

Chabbi, Milind ;

Grossman, Max ;

Sarkar, Vivek ;

Yan, Yonghong .

IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013), 2013, :712-725

[8]

Chu Michael L., 2017, CHIUW 17

[9]

Gropp W., 1994, Using MPI: Portable Parallel Programming with the Message-Passing Interface

[10]

Hayashi A., 2019, GPUITERATOR IMPLEMEN

← 1 2 →