Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels

被引:10
作者
Gilman G. [1 ]
Ogden S.S. [1 ]
Guo T. [1 ]
Walls R.J. [1 ]
机构
[1] Worcester Polytechnic Institute, Worcester
来源
Performance Evaluation Review | 2021年 / 48卷 / 03期
关键词
concurrent kernels; gpgpus; scheduling algorithms;
D O I
10.1145/3453953.3453972
中图分类号
学科分类号
摘要
In this work, we empirically derive the scheduler's behavior under concurrent workloads for NVIDIA's Pascal, Volta, and Turing microarchitectures. In contrast to past studies that suggest the scheduler uses a round-robin policy to assign thread blocks to streaming multiprocessors (SMs), we instead find that the scheduler chooses the next SM based on the SM's local resource availability. We show how this scheduling policy can lead to significant, and seemingly counter-intuitive, performance degradation; for example, a decrease of one thread per block resulted in a 3.58X increase in execution time for one kernel in our experiments. We hope that our work will be useful for improving the accuracy of GPU simulators and aid in the development of novel scheduling algorithms. © 2021 Copyright is held by the owner/author(s).
引用
收藏
页码:81 / 88
页数:7
相关论文
共 20 条
[1]  
Adriaens J.T., Compton K., Kim N.S., Schulte M.J., The case for gpgpu spatial multitasking, IEEE International Symposium on High-Performance Comp Architecture, (2012)
[2]  
Amert T., Otterness N., Yang M., Anderson J.H., Smith F.D., Gpu scheduling on the nvidia tx2: Hidden details revealed, 2017 IEEE Real-Time Systems Symposium (RTSS, 2017
[3]  
Ausavarungnirun R., Landgraf J., Miller V., Ghose S., Gandhi J., Rossbach C.J., Mutlu O., Mosaic: A gpu memory manager with application-Transparent support for multiple page sizes, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO, 2017
[4]  
Awatramani M., Zambreno J., Rover D., Increasing gpu throughput using kernel interleaved thread block scheduling, 2013 IEEE 31st International Conference on Computer Design (ICCD, (2013)
[5]  
Belviranli M.E., Khorasani F., Bhuyan L.N., Gupta R., Cumas: Data transfer aware multi-Application scheduling for shared gpus, Proceedings of the 2016 International Conference on Supercomputing, ICS ?16, 2016
[6]  
Che S., Boyer M., Meng J., Tarjan D., Sheaer J.W., Lee S., Skadron K., Rodinia: A benchmark suite for heterogeneous computing, 2009 IEEE International Symposium on Workload Characterization (IISWC), (2009)
[7]  
Jain P., Mo X., Jain A., Subbaraj H., Durrani R.S., Tumanov A., Gonzalez J., Stoica I., Dynamic Space-Time Scheduling for Gpu Inference, (2018)
[8]  
Jia Z., Maggioni M., Smith J., Scarpazza D.P., Dissecting the Nvidia Turing t4 Gpu Via Microbenchmarking, (2019)
[9]  
Jia Z., Maggioni M., Staiger B., Scarpazza D.P., Dissecting the Nvidia Volta Gpu Architecture Via Microbenchmarking, (2018)
[10]  
Li H., Yu D., Kumar A., Tu Y.-C., Performance modeling in cuda streams-A means for high-Throughput data processing, IEEE International Conference on Big Data, (2014)