FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

被引：2

作者：

Lopez-Albelda, Bernabe ^{[1
]}

Castro, Francisco M. ^{[1
]}

Gonzalez-Linares, Jose M. ^{[1
]}

Guil, Nicolas ^{[1
]}

机构：

[1] Univ Malaga, Dept Comp Architecture, Campus Teatinos, Malaga 29071, Spain

来源：

JOURNAL OF SUPERCOMPUTING | 2022年 / 78卷 / 01期

关键词：

GPU scheduling; Concurrent kernel execution; Online profiling; Simultaneous multikernel; PREDICTION;

D O I：

10.1007/s11227-021-03819-z

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Nowadays, GPU clusters are available in almost every data processing center. Their GPUs are typically shared by different applications that might have different processing needs and/or different levels of priority. In this scenario, concurrent kernel execution can leverage the use of devices by co-executing kernels having a different or complementary resource utilization profile. A paramount issue in concurrent kernel execution on GPU is to obtain a suitable distribution of streaming multiproccessor (SM) resources among co-executing kernels to fulfill different scheduling aims. In this work, we present a software scheduler, named FlexSched, that employs a runtime mechanism with low overhead to perform intra-SM cooperative thread arrays (a.k.a. thread block) allocation of co-executing kernels. It also implements a productive online profiling mechanism that allows dynamically changing kernels resource assignation attending to the instant performance achieved for co-running kernels. An important characteristic of our approach is that off-line kernel analysis to establish the best resource assignment of co-located kernels is not required. Thus, it can run in any system where new applications must be immediately scheduled. Using a set of nine applications (13 kernels), we show our approach improves the co-execution performance of recent slicing methods. Moreover, our approach obtains a co-execution speedup of 1.40x while slicing method just achieves 1.29x. In addition, we test FlexSched in a real scheduling scenario where new applications are launched as soon as GPU resources become available. In this scenario, FlexSched reduces the average overall execution time by a factor of 1.25x with respect to the time obtained when proprietary hardware (HyperQ) is employed. Finally, FlexSched is also used to implement scheduling policies that guarantee maximum turnaround time for latency sensitive applications while achieving high resource use through kernel co-execution.

引用

页码：43 / 71

页数：29

共 30 条

[11] A tasks reordering model to reduce transfers overhead on GPUs [J].

Lazaro-Munoz, A. J. ;

Gonzalez-Linares, J. M. ;

Gomez-Luna, J. ;

Guil, N. .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 109 :258-271

[12]

Lee H, 2014, DES AUT TEST EUROPE

[13] Efficient GPU Spatial-Temporal Multitasking [J].

Liang, Yun ;

Huynh Phung Huynh ;

Rupnow, Kyle ;

Goh, Rick Siow Mong ;

Chen, Deming .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (03) :748-760

[14]

NVIDIA, 2020, CUPTI US GUID VERS D

[15] Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels [J].

Pai, Sreepathi ;

Govindarajan, R. ;

Thazhuthaveetil, Matthew J. .

PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'14), 2014, :483-484

[16] Improving GPGPU Concurrency with Elastic Kernels [J].

Pai, Sreepathi ;

Thazhuthaveetil, Matthew J. ;

Govindarajan, R. .

ACM SIGPLAN NOTICES, 2013, 48 (04) :407-418

[17] Dynamic resource management for efficient utilization of multitasking GPUs [J].

Park J.J.K. ;

Park Y. ;

Mahlke S. .

ACM SIGPLAN Notices, 2017, 52 (04) :527-540

[18]

Park JJK, 2015, ACM SIGPLAN NOTICES, V50, P593, DOI [10.1145/2775054.2694346, 10.1145/2694344.2694346]

[19] cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs [J].

Shekofteh, S. -Kazem ;

Noori, Hamid ;

Naghibzadeh, Mahmoud ;

Froning, Holger ;

Yazdi, Hadi Sadoghi .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (04) :766-778

[20]

Tanasic I, 2014, CONF PROC INT SYMP C, P193, DOI 10.1109/ISCA.2014.6853208

← 1 2 3 →