Efficient GPU Spatial-Temporal Multitasking

被引：66

作者：

Liang, Yun ^{[1
]}

Huynh Phung Huynh ^{[2
]}

Rupnow, Kyle ^{[3
]}

Goh, Rick Siow Mong ^{[2
]}

Chen, Deming ^{[4
]}

机构：

[1] Peking Univ, Sch EECS, Ctr Energy Efficient Comp & Applicat, Beijing 100871, Peoples R China

[2] A STAR Inst High Performance Comp, Dept Comp Sci, Singapore 138632, Singapore

[3] Adv Digital Sci Ctr, Singapore 138632, Singapore

[4] Univ Illinois, Urbana, IL 61801 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2015年 / 26卷 / 03期

基金：

中国国家自然科学基金;

关键词：

GPU; spatial; temporal; multitasking; resource allocation; COMPILER; MODEL;

D O I：

10.1109/TPDS.2014.2313342

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Heterogeneous computing nodes are now pervasive throughout computing, and GPUs have emerged as a leading computing device for application acceleration. GPUs have tremendous computing potential for data-parallel applications, and the emergence of GPUs has led to proliferation of GPU-accelerated applications. This proliferation has also led to systems in which many applications are competing for access to GPU resources, and efficient utilization of the GPU resources is critical to system performance. Prior techniques of temporal multitasking can be employed with GPU resources as well, but not all GPU kernels make full use of the GPU resources. There is, therefore, an unmet need for spatial multitasking in GPUs. Resources used inefficiently by one kernel can be instead assigned to another kernel that can more effectively use the resources. In this paper we propose a software-hardware solution for efficient spatial-temporal multitasking and a software based emulation framework for our system. We pair an efficient heuristic in software with hardware leaky-bucket based thread-block interleaving to implement spatial-temporal multitasking. We demonstrate our techniques on various GPU architecture using nine representative benchmarks from CUDA SDK. Our experiments on Fermi GTX480 demonstrate performance improvement by up to 46% (average 26%) over sequential GPU task execution and 37% (average 18%) over default concurrent multitasking. Compared with the state-of-the-art Kepler K20 using Hyper-Q technology, our technique achieves up to 40% (average 17%) performance improvement over default concurrent multitasking.

引用

页码：748 / 760

页数：13

共 21 条

[1]

Adriaens J., 2012, PROC IEEE 18 INT S H, P1

[2]

[Anonymous], P ACM SIGARCH COMP A

[3]

[Anonymous], THESIS U ILLINOIS UR

[4]

[Anonymous], Nvidia CUDA Programming guide

[5] An Adaptive Performance Modeling Tool for GPU Architectures [J].

Baghsorkhi, Sara S. ;

Delahaye, Matthieu ;

Patel, Sanjay J. ;

Gropp, William D. ;

Hwu, Wen-mei W. .

ACM SIGPLAN NOTICES, 2010, 45 (05) :105-114

[6]

Bakhoda A, 2009, INT SYM PERFORM ANAL, P163, DOI 10.1109/ISPASS.2009.4919648

[7]

Cederman Daniel., 2008, GH 08 P 23 ACM SIGGR, P57

[8] An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization [J].

Cui, Zheng ;

Liang, Yun ;

Rupnow, Kyle ;

Chen, Deming .

2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, :83-94

[9]

de Bruijin N.G., 1981, ASYMPTOTIC METHODS A

[10] Dynamic warp formation and scheduling for efficient GPU control flow [J].

Fung, Wilson W. L. ;

Sham, Ivan ;

Yuan, George ;

Aamodt, Tor M. .

MICRO-40: PROCEEDINGS OF THE 40TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 2007, :407-+

← 1 2 3 →