Modelling Performance Loss due to Thread Imbalance in Stochastic Variable-Length SIMT Workloads

被引：0

作者：

Swatman, Stephen Nicholas ^{[1
,2
]}

Varbanescu, Ana-Lucia ^{[3
]}

Krasznahorkay, Attila ^{[2
]}

Pimentel, Andy ^{[1
]}

机构：

[1] Univ Amsterdam, Amsterdam, Netherlands

[2] European Org Nucl Res, Geneva, Switzerland

[3] Univ Twente, Enschede, Netherlands

来源：

2022 30TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, MASCOTS | 2022年

关键词：

SIMT; imbalance; performance modelling; GPU;

D O I：

10.1109/MASCOTS56607.2022.00026

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

When designing algorithms for single-instruction multiple-thread (SIMT) devices such as general purpose graphics processing units (GPGPUs), thread imbalance is an important performance consideration. Thread imbalance can emerge in iterative applications where workloads are of variable length, because threads processing larger amounts of work will cause threads with less work to idle. This form of thread imbalance influences the design space of algorithms-particularly in terms of processing granularity-but we lack models to quantify its impact on application performance. In this paper, we present a statistical model for quantifying the performance loss due to thread imbalance for iterative SIMT applications with stochastic, variable-length workloads. Our model is designed to operate with minimal knowledge of the implementation details of the algorithm, relying solely on an understanding of the probability distribution of the lengths of the workloads. We validate our model against a synthetic benchmark based on a Monte Carlo simulation of matrix exponentiation, and show that our model achieves nearly perfect accuracy. Compared to empirical data extracted from real hardware, our model maintains a high degree of accuracy, predicting mean performance loss within a margin of 2%.

引用

页码：137 / 144

页数：8

共 25 条

[1]

[Anonymous], 1992, A Journal of Theoretical and Applied Statistics, DOI DOI 10.1080/02331889208802365

[2]

[Anonymous], 2022, CUDA C++ programming guide

[3]

Anzt H, 2018, INT SYM COMP ARCHIT, P229, DOI [10.1109/CAHPC.2018.8645946, 10.1109/SBAC-PAD.2018.00046]

[4] A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term [J].

Bal, Henri ;

Epema, Dick ;

de laat, Cees ;

van Nieuwpoort, Rob ;

Romein, John ;

Seinstra, Frank ;

Snoek, Cees ;

Wijshoff, Harry .

COMPUTER, 2016, 49 (05) :54-63

[5]

BIALAS P, 2016, INT C PARALLEL PROCE, P570, DOI DOI 10.1007/978-3-319-32149-353

[6] Accelerating genetic algorithms with GPU computing: A selective overview [J].

Cheng, John Runwei ;

Gen, Mitsuo .

COMPUTERS & INDUSTRIAL ENGINEERING, 2019, 128 :514-525

[7] Volta: Performance and Programmability [J].

Choquette, Jack ;

Giroux, Olivier ;

Foley, Denis .

IEEE MICRO, 2018, 38 (02) :42-52

[8] VERY HIGH-SPEED COMPUTING SYSTEMS [J].

FLYNN, MJ .

PROCEEDINGS OF THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, 1966, 54 (12) :1901-&

[9]

Frey S., 2012, Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2012), P399, DOI 10.1109/PDP.2012.62

[10]

Harris M., 2017, Cooperative Groups: Flexible CUDA Thread Programming

← 1 2 3 →