Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

被引:35
|
作者
Kurzak, Jakub [1 ]
Alvaro, Wesley [1 ]
Dongarra, Jack [1 ,2 ,3 ,4 ]
机构
[1] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
[2] Oak Ridge Natl Lab, Div Math & Comp Sci, Oak Ridge, TN USA
[3] Univ Manchester, Sch Math, Manchester, NH USA
[4] Univ Manchester, Sch Comp Sci, Manchester, NH USA
关键词
Instruction level parallelism; Single Instruction Multiple Data; Synergistic Processing Element; Loop optimizations; Vectorization; LINEAR-EQUATIONS; SOLVING SYSTEMS;
D O I
10.1016/j.parco.2008.12.010
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigen-value computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C = C - A x B-T operation and the C = C - A x B operation for matrices of size 64 x 64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:138 / 150
页数:13
相关论文
共 50 条
  • [1] Fast and small short vector SIMD matrix multiplication kernels for the synergistic processing element of the CELL processor
    Alvaro, Wesley
    Kurzak, Jakub
    Dongarra, Jack
    COMPUTATIONAL SCIENCE - ICCS 2008, PT 1, 2008, 5101 : 935 - 944
  • [2] Performance of an embedded optical vector matrix multiplication processor architecture
    Yang, C.
    Cui, G. X.
    Huang, Y. Y.
    Wu, L.
    Yang, H.
    Zhang, Y. H.
    IET OPTOELECTRONICS, 2010, 4 (04) : 159 - 164
  • [3] Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures
    Henretty, Tom
    Stock, Kevin
    Pouchet, Louis-Noel
    Franchetti, Franz
    Ramanujam, J.
    Sadayappan, P.
    COMPILER CONSTRUCTION, 2011, 6601 : 225 - +
  • [4] A new parallel DSP with short-vector memory architecture
    Fridman, J
    Anderson, WC
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 2139 - 2142
  • [5] Optimizing matrix-matrix multiplication on intel's advanced vector extensions multicore processor
    Hemeida, A. M.
    Hassan, S. A.
    Alkhalaf, Salem
    Mahmoud, M. M. M.
    Saber, M. A.
    Eldin, Ayman M. Bahaa
    Senjyu, Tomonobu
    Alayed, Abdullah H.
    AIN SHAMS ENGINEERING JOURNAL, 2020, 11 (04) : 1179 - 1190
  • [6] An Embedded Optical Vector Matrix Multiplication Processor
    Fuhui-kai
    INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY, PTS 1-4, 2013, 263-266 : 1334 - 1337
  • [7] An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUs
    Maggioni, Marco
    Berger-Wolf, Tanya
    2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2013, 18 : 329 - 338
  • [8] Controlled optical fibre processor for matrix/vector multiplication
    Pilipovich, VA
    Esman, AK
    Goncharenko, IA
    Posedko, VS
    Solonovich, IF
    SECOND INTERNATIONAL CONFERENCE ON OPTICAL INFORMATION PROCESSING, 1996, 2969 : 125 - 128
  • [9] An efficient SIMD compression format for sparse matrix-vector multiplication
    Chen, Xinhai
    Xie, Peizhen
    Chi, Lihua
    Liu, Jie
    Gong, Chunye
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (23):
  • [10] SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision
    Hishinuma, Toshiaki
    Hasegawa, Hidehiko
    Tanaka, Teruo
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 21 - 34