AUGEM:Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs

被引:87
作者
Wang, Qian [1 ]
Zhang, Xianyi [1 ,2 ]
Zhang, Yunquan [3 ]
Yi, Qing [4 ]
机构
[1] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Software, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Software, State Key Lab Comp Architecture, Beijing, Peoples R China
[4] Univ Colorado, Boulder, CO 80309 USA
来源
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) | 2013年
关键词
DLA code optimization; code generation; auto-tuning;
D O I
10.1145/2503210.2503219
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.
引用
收藏
页数:12
相关论文
共 21 条
[1]  
Advanced Micro Devices Inc., 2012, AMD NEW BULLD PIL IN
[2]  
[Anonymous], SUPERCOMPUTING 1998
[3]  
Ballard G., 2012, CORR
[4]   Graph Expansion and Communication Costs of Fast Matrix Multiplication [J].
Ballard, Grey ;
Demmel, James ;
Holtz, Olga ;
Schwartz, Oded .
JOURNAL OF THE ACM, 2012, 59 (06)
[5]  
Belter G., 2009, INT S COD GEN OPT MA
[6]  
Bilmes J., 1997, Conference Proceedings of the 1997 International Conference on Supercompting, P340, DOI 10.1145/263580.263662
[7]  
Chen C, 2005, INT SYM CODE GENER, P111
[8]  
Cooper KeithD., 2004, Engineering a Compiler
[9]  
Cui H., 2011, P 25 IEEE INT PAR DI
[10]   Layout-Oblivious Compiler Optimization for Matrix Computations [J].
Cui, Huimin ;
Yi, Qing ;
Xue, Jingling ;
Feng, Xiaobing .
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2013, 9 (04)