GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

被引：4

作者：

Ino, Fumihiko ^{[1
]}

Nakagawa, Shinta ^{[2
]}

Hagihara, Kenichi ^{[1
]}

机构：

[1] Osaka Univ, Grad Sch Informat Sci & Technol, Suita, Osaka 5650871, Japan

[2] NEC Corp Ltd, Storage Div, Fuchu, Tokyo 1838501, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2013年 / E96D卷 / 12期

关键词：

stream processing; GPGPU; CUDA; task scheduling; GRAPHICS;

D O I：

10.1587/transinf.E96.D.2604

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple GPUs and CPUs.

引用

页码：2604 / 2616

页数：13

共 29 条

[1]

[Anonymous], 2012, NVIDIAS NEXT GEN CUD

[2]

[Anonymous], 2000, Parallel Programming in OpenMP

[3] The SWISS-PROT protein sequence data bank and its supplement TrEMBL [J].

Bairoch, A ;

Apweller, R .

NUCLEIC ACIDS RESEARCH, 1997, 25 (01) :31-36

[4]

Bhat V, 2007, CLUSTER COMPUT, V10, P365, DOI 10.1007/s10586-007-0023-x

[5] Comparison of scheduling rules in a flow shop with multiple processors: A simulation [J].

Brah, SA ;

Wheeler, GE .

SIMULATION, 1998, 71 (05) :302-311

[6]

Chen L., 2010, P 24 IEEE INT PAR DI

[7]

Diamos G., 2010, P 24 IEEE INT PAR DI

[8]

Hagiescu A., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P467, DOI 10.1109/IPDPS.2011.52

[9]

Hormati A, 2011, ACM SIGPLAN NOTICES, V46, P381, DOI [10.1145/1961295.1950409, 10.1145/1961296.1950409]

[10] Scalable Framework for Mapping Streaming Applications onto Multi-GPU Systems [J].

Huynh, Huynh Phung ;

Hagiescu, Andrei ;

Wong, Weng-Fai ;

Goh, Rick Siow Mong .

ACM SIGPLAN NOTICES, 2012, 47 (08) :1-10

← 1 2 3 →