A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels

被引:30
作者
Valsalam, V [1 ]
Skjellum, A [1 ]
机构
[1] Mississippi State Univ, Dept Comp Sci, High Performance Comp Lab, Mississippi State, MS 39762 USA
关键词
matrix multiplication; hierarchical matrix storage; Morton order; polyalgorithms; Strassen's algorithm; kernel interface;
D O I
10.1002/cpe.630
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high-performance matrix-multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary-sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high-performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object-oriented or compiler-based approaches. Copyright (C) 2002 John Wiley Sons, Ltd.
引用
收藏
页码:805 / 839
页数:35
相关论文
共 57 条
[1]   Emmerald: a fast matrix-matrix multiply using Intel's SSE instructions [J].
Aberdeen, D ;
Baxter, J .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2001, 13 (02) :103-119
[2]   A three-dimensional approach to parallel matrix multiplication [J].
Agarwal, RC ;
Balle, SM ;
Gustavson, FG ;
Joshi, M ;
Palkar, P .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1995, 39 (05) :575-582
[3]   EXPLOITING FUNCTIONAL PARALLELISM OF POWER2 TO DESIGN HIGH-PERFORMANCE NUMERICAL ALGORITHMS [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (05) :563-576
[4]  
ANDERSEN BS, 1999, P 3 INT C PAR PROC A, P63
[5]  
BILMES J, 1997, P INT C SUP JUL
[6]  
BILMES J, 1998, UCBCSD981020
[7]  
Birov L, 1999, LECT NOTES COMPUT SC, V1662, P186
[8]  
BIROV L, 1999, P 9 SIAM C PAR PROC
[9]  
BRENT RP, 1970, 157 CS STANF U
[10]  
CHATTERJEE S., 1999, P 11 ACM S PAR ALG A