Efficient utilization of SIMD extensions

被引:32
作者
Franchetti, F [1 ]
Kral, S [1 ]
Lorenz, J [1 ]
Ueberhuber, CW [1 ]
机构
[1] Vienna Univ Technol, A-1040 Vienna, Austria
关键词
autontatic vectorization; digital signal processing (DSP); fast Fourier transform (FFT); short vector single instruction; multiple data (SIMD); symbolic vectorization;
D O I
10.1109/JPROC.2004.840491
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper targets automatic perfomance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data (SIMD) parallelisin. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions. FFTW ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software pack-ages generate and optimize ANSI C code and feed it into the target machine's general-purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and, thus, inhibits satisfactory performance on processors featuring short vector extensions. this paper describes special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes; 1) symbolic vectorization of digital signal processing transforms; 2) straight-line code vectorization for numerical kernels; and 3) compiler back ends for straight-line code with vector,instructions. Methods from all three areas were combined with FFTW. SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
引用
收藏
页码:409 / 425
页数:17
相关论文
共 53 条
[1]   Emmerald: a fast matrix-matrix multiply using Intel's SSE instructions [J].
Aberdeen, D ;
Baxter, J .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2001, 13 (02) :103-119
[2]  
Aho Alfred V., 1986, ADDISON WESLEY SERIE
[3]  
ALMASI G, P EUR PAR 03 C PAR D, P147
[4]  
*AMD, 2000, MAN
[5]  
[Anonymous], P ACM SIGPLAN 2001 C
[6]  
[Anonymous], 1999, 9 SIAM C PAR PROC SC
[7]  
*ANSI, 1999, 98991999E ISO IEC AN
[8]  
*APPL COMP INC, 2001, VDSP LIB
[9]   COMPILER TRANSFORMATIONS FOR HIGH-PERFORMANCE COMPUTING [J].
BACON, DF ;
GRAHAM, SL ;
SHARP, OJ .
ACM COMPUTING SURVEYS, 1994, 26 (04) :345-420
[10]   A STUDY OF REPLACEMENT ALGORITHMS FOR A VIRTUAL-STORAGE COMPUTER [J].
BELADY, LA .
IBM SYSTEMS JOURNAL, 1966, 5 (02) :78-&