Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

被引：25

作者：

Dai, Hongwen ^{[1
]}

Lin, Zhen ^{[1
]}

Li, Chao ^{[1
]}

Zhao, Chen ^{[2
]}

Wang, Fei ^{[2
]}

Zheng, Nanning ^{[2
]}

Zhou, Huiyang ^{[1
]}

机构：

[1] North Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27695 USA

[2] Xi An Jiao Tong Univ, Sch Elect & Informat Engn, Xian, Shaanxi, Peoples R China

来源：

2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA) | 2018年

关键词：

HIGH-PERFORMANCE; CACHE;

D O I：

10.1109/HPCA.2018.00027

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

引用

页码：208 / 220

页数：13

共 48 条

[31]

Li A., 2015, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, P17

[32] Locality-Driven Dynamic GPU Cache Bypassing [J].

Li, Chao ;

Song, Shuaiwen Leon ;

Dai, Hongwen ;

Sidelnik, Albert ;

Hari, Siva Kumar Sastry ;

Zhou, Huiyang .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, :67-77

[33]

Li D, 2015, INT S HIGH PERF COMP, P89, DOI 10.1109/HPCA.2015.7056024

[34] Efficient GPU Spatial-Temporal Multitasking [J].

Liang, Yun ;

Huynh Phung Huynh ;

Rupnow, Kyle ;

Goh, Rick Siow Mong ;

Chen, Deming .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (03) :748-760

[35] Improving GPGPU Concurrency with Elastic Kernels [J].

Pai, Sreepathi ;

Thazhuthaveetil, Matthew J. ;

Govindarajan, R. .

ACM SIGPLAN NOTICES, 2013, 48 (04) :407-418

[36]

Qureshi MK, 2006, INT SYMP MICROARCH, P423

[37] Cache-Conscious Wavefront Scheduling [J].

Rogers, Timothy G. ;

O'Connor, Mike ;

Aamodt, Tor M. .

2012 IEEE/ACM 45TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-45), 2012, :72-83

[38] ACHIEVING EXASCALE CAPABILITIES THROUGH HETEROGENEOUS COMPUTING [J].

Schulte, Michael J. ;

Ignatowski, Mike ;

Loh, Gabriel H. ;

Beckmann, Bradford M. ;

Brantley, William C. ;

Gurumurthi, Sudhanva ;

Jayasena, Nuwan ;

Paul, Indrani ;

Reinhardt, Steven K. ;

Rodgers, Gregory .

IEEE MICRO, 2015, 35 (04) :26-36

[39]

Sethia A, 2015, INT S HIGH PERF COMP, P174, DOI 10.1109/HPCA.2015.7056031

[40] Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution [J].

Sethia, Ankit ;

Mahlke, Scott .

2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, :647-658

← 1 2 3 4 5 →