AUGEM:Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs

被引：87

作者：

Wang, Qian ^{[1
]}

Zhang, Xianyi ^{[1
,2
]}

Zhang, Yunquan ^{[3
]}

Yi, Qing ^{[4
]}

机构：

[1] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Software, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Software, State Key Lab Comp Architecture, Beijing, Peoples R China

[4] Univ Colorado, Boulder, CO 80309 USA

来源：

2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) | 2013年

关键词：

DLA code optimization; code generation; auto-tuning;

D O I：

10.1145/2503210.2503219

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.

引用

页数：12

共 21 条

[1]

Advanced Micro Devices Inc., 2012, AMD NEW BULLD PIL IN

[2]

[Anonymous], SUPERCOMPUTING 1998

[3]

Ballard G., 2012, CORR

[4] Graph Expansion and Communication Costs of Fast Matrix Multiplication [J].

Ballard, Grey ;

Demmel, James ;

Holtz, Olga ;

Schwartz, Oded .

JOURNAL OF THE ACM, 2012, 59 (06)

[5]

Belter G., 2009, INT S COD GEN OPT MA

[6]

Bilmes J., 1997, Conference Proceedings of the 1997 International Conference on Supercompting, P340, DOI 10.1145/263580.263662

[7]

Chen C, 2005, INT SYM CODE GENER, P111

[8]

Cooper KeithD., 2004, Engineering a Compiler

[9]

Cui H., 2011, P 25 IEEE INT PAR DI

[10] Layout-Oblivious Compiler Optimization for Matrix Computations [J].

Cui, Huimin ;

Yi, Qing ;

Xue, Jingling ;

Feng, Xiaobing .

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2013, 9 (04)

← 1 2 3 →