Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

被引:8
作者
Lim, Roktaek [1 ]
Lee, Yeongha [1 ]
Kim, Raehyun [1 ]
Choi, Jaeyoung [1 ]
Lee, Myungho [2 ]
机构
[1] Soongsil Univ, Seoul 06978, South Korea
[2] Myongji Univ, Yongin 17058, Gyeonggi, South Korea
基金
新加坡国家研究基金会;
关键词
Manycore; Intel Xeon Phi; Intel Skylake-SP; Auto-tuning; Matrix-matrix multiplication; AVX-512; IMPLEMENTATION;
D O I
10.1007/s11227-018-2702-1
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The general matrix-matrix multiplication is a core building block for implementing Basic Linear Algebra Subprograms. This paper presents a methodology for automatically producing the matrix-matrix multiplication kernels tuned for the Intel Xeon Phi Processor code-named Knights Landing and the Intel Skylake-SP processors with AVX-512 intrinsic functions. The architecture of the latest manycore processors has been complicated in the levels of parallelism and cache hierarchies; it is not easy to find the best combination of optimization techniques for a given application. Our approach produces matrix multiplication kernels through a process of heuristic auto-tuning based on generating multiple kernels and selecting the fastest ones through performance tests. The tuning parameters include the size of block matrices for registers and caches, prefetch distances, and loop unrolling depth. Parameters for multithreaded execution, such as identifying loops to parallelize and the optimal number of threads for such loops are also investigated. We also present a method to reduce the parameter search space based on our previous research results.
引用
收藏
页码:7895 / 7908
页数:14
相关论文
共 15 条
  • [1] [Anonymous], 2014, ACM INT C SUP 25 ANN
  • [2] [Anonymous], 2018, Math kernel library
  • [3] [Anonymous], 2016, Intel Xeon Phi Processor High Performance Programming, DOI [10.1016/B978-0-12-809194-4.00022-3, DOI 10.1016/B978-0-12-809194-4.00022-3]
  • [4] Anatomy of high-performance matrix multiplication
    Goto, Kazushige
    Van De Geijn, Robert A.
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 34 (03):
  • [5] Gunnels J. A., 2001, Computational Science - ICCS 2001. International Conference. Proceedings, Part I (Lecture Notes in Computer Science Vol.2073), P51
  • [6] Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel® Xeon Phi™ Coprocessor
    Heinecke, Alexander
    Vaidyanathan, Karthikeyan
    Smelyanskiy, Mikhail
    Kobotov, Alexander
    Dubtsov, Roman
    Henry, Greg
    Shet, Aniruddha G.
    Chrysos, George
    Dubey, Pradeep
    [J]. IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013), 2013, : 126 - 137
  • [7] OpenMP-based parallel implementation of matrix-matrix multiplication on the Intel Knights Landing
    Lim, Roktaek
    Lee, Yeongha
    Kim, Raehyun
    Choi, Jaeyoung
    [J]. HPC ASIA'18: PROCEEDINGS OF WORKSHOPS OF HPC ASIA, 2018, : 63 - 66
  • [8] An implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512
    Lim, Roktaek
    Lee, Yeongha
    Kim, Raehyun
    Choi, Jaeyoung
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2018, 21 (04): : 1785 - 1795
  • [9] Analytical Modeling Is Enough for High-Performance BLIS
    Low, Tze Meng
    Igual, Francisco D.
    Smith, Tyler M.
    Quintana-Orti, Enrique S.
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2016, 43 (02):
  • [10] Anatomy of High-Performance Many-Threaded Matrix Multiplication
    Smith, Tyler M.
    van de Geijn, Robert
    Smelyanskiy, Mikhail
    Hammond, Jeff R.
    Van Zee, Field G.
    [J]. 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,