GPUIterator: Bridging the Gap between Chapel and GPU Platforms

被引:7
作者
Hayashi, Akihiro [1 ]
Paul, Sri Raj [2 ]
Sarkar, Vivek [2 ]
机构
[1] Rice Univ, Dept Comp Sci, Houston, TX 77251 USA
[2] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
来源
CHIUW'19: PROCEEDINGS OF THE ACM SIGPLAN 6TH CHAPEL IMPLEMENTERS AND USERS WORKSHOP | 2019年
关键词
Chapel; GPU; Parallel Iterators;
D O I
10.1145/3329722.3330142
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
引用
收藏
页码:2 / 11
页数:10
相关论文
共 17 条
[11]  
Joyner M., 2006, IPDPS 06, DOI [10.1109/IPDPS.2006.1639499, DOI 10.1109/IPDPS.2006.1639499]
[12]  
Numrich R.W., 1998, ACM SIGPLAN Fortran Forum, V17, P1, DOI [/10.1145/289918.289920, DOI 10.1145/289918.289920, 10.1145/289918.289920]
[13]  
Pai S, 2012, INT CONFER PARA, P33
[14]   Performance Portability with the Chapel Language [J].
Sidelnik, Albert ;
Maleki, Saeed ;
Chamberlain, Bradford L. ;
Garzaran, Maria J. ;
Padua, David .
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, :582-594
[15]  
The Center for Research Computing (CRC), 2019, RES COMP
[16]  
The TOP500 project, 2018, TOP500 LISTS
[17]   UPC plus plus : A PGAS Extension for C plus [J].
Zheng, Yili ;
Kamil, Amir ;
Driscoll, Michael B. ;
Shan, Hongzhang ;
Yelick, Katherine .
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,