Dynamic warp formation and scheduling for efficient GPU control flow

被引:210
作者
Fung, Wilson W. L. [1 ]
Sham, Ivan [1 ]
Yuan, George [1 ]
Aamodt, Tor M. [1 ]
机构
[1] Univ British Columbia, Dept Elect & Comp Engn, Vancouver, BC V5Z 1M9, Canada
来源
MICRO-40: PROCEEDINGS OF THE 40TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE | 2007年
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
10.1109/MICRO.2007.30
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically, use single-instruction. multiple-data (SIMD) pipelines to achieve high performance with minimal overhead incurred by control hardware. Scalar threads are grouped together into S(MD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow, instructions in the GP U instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One approach is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance. In this paper, explore mechanisms for more efficient SIMD branch execution on GPUs. We show that a realistic hardware implementation that dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes improves performance by an average of 20.7% for an estimated area increase of 4.7%.
引用
收藏
页码:407 / +
页数:2
相关论文
共 37 条
[1]  
*ADV MICR DEV INC, 2006, ATI CTM GUIDE
[2]  
Allen J. R., 1983, P 10 ACM SIGACT SIGP, P177
[3]  
[Anonymous], 1995, CONCISE OXFORD DICT
[4]  
[Anonymous], [No title captured]
[5]  
[Anonymous], SPEC CPU2006 BENCHM
[6]  
[Anonymous], 2007, Nvidia cuda programming guide
[7]  
BLINN JF, 1976, COMMUN ACM, V19, P542, DOI [10.1145/360349.360353, 10.1145/965143.563322]
[8]   Sparse matrix solvers on the GPU:: Conjugate gradients and multigrid [J].
Bolz, J ;
Farmer, I ;
Grinspun, E ;
Schröder, P .
ACM TRANSACTIONS ON GRAPHICS, 2003, 22 (03) :917-924
[9]   ILIAC IV SYSTEM [J].
BOUKNIGHT, WJ ;
SAMEH, AH ;
SLOTNICK, DL ;
MCINTYRE, DE ;
DENENBERG, SA ;
RANDALL, JM .
PROCEEDINGS OF THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, 1972, 60 (04) :369-+
[10]   Brook for GPUs: Stream computing on graphics hardware [J].
Buck, I ;
Foley, T ;
Horn, D ;
Sugerman, J ;
Fatahalian, K ;
Houston, M ;
Hanrahan, P .
ACM TRANSACTIONS ON GRAPHICS, 2004, 23 (03) :777-786