Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

被引:35
|
作者
Kurzak, Jakub [1 ]
Alvaro, Wesley [1 ]
Dongarra, Jack [1 ,2 ,3 ,4 ]
机构
[1] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
[2] Oak Ridge Natl Lab, Div Math & Comp Sci, Oak Ridge, TN USA
[3] Univ Manchester, Sch Math, Manchester, NH USA
[4] Univ Manchester, Sch Comp Sci, Manchester, NH USA
关键词
Instruction level parallelism; Single Instruction Multiple Data; Synergistic Processing Element; Loop optimizations; Vectorization; LINEAR-EQUATIONS; SOLVING SYSTEMS;
D O I
10.1016/j.parco.2008.12.010
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigen-value computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C = C - A x B-T operation and the C = C - A x B operation for matrices of size 64 x 64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:138 / 150
页数:13
相关论文
共 50 条
  • [31] Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture
    Aananthakrishnan, Sriram
    Pawlowski, Robert
    Fryman, Joshua
    Hur, Ibrahim
    2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
  • [32] Charge-mode parallel architecture for vector-matrix multiplication
    Genov, R
    Cauwenberghs, G
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-ANALOG AND DIGITAL SIGNAL PROCESSING, 2001, 48 (10): : 930 - 936
  • [33] The vector fixed point unit of the synergistic processor element of the cell architecture processor
    Mäding, N
    Leenstra, J
    Pille, J
    Sautter, R
    Büttner, S
    Ehrenreich, S
    Haller, W
    ESSCIRC 2005: PROCEEDINGS OF THE 31ST EUROPEAN SOLID-STATE CIRCUITS CONFERENCE, 2005, : 203 - 206
  • [34] The Vector Fixed Point Unit of the synergistic processor element of the cell architecture processor
    Maeding, N.
    Leenstra, J.
    Pille, J.
    Sautter, R.
    Buettner, S.
    Ehrenreich, S.
    Haller, W.
    2006 DESIGN AUTOMATION AND TEST IN EUROPE, VOLS 1-3, PROCEEDINGS, 2006, : 1579 - +
  • [35] Optimizing Sparse Matrix-Vector Multiplication on GPUs via Index Compression
    Sun, Xue
    Wei, Kai-Cheng
    Lai, Lien-Fu
    Tsai, Sung-Han
    Wu, Chao-Chin
    PROCEEDINGS OF 2018 IEEE 3RD ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC 2018), 2018, : 598 - 602
  • [36] Effective Implementation of Matrix-Vector Multiplication on Intel's AVX multicore Processor
    Hassan, Somaia A.
    Mahmoud, Mountasser M. M.
    Hemeida, A. M.
    Saber, Mahmoud A.
    COMPUTER LANGUAGES SYSTEMS & STRUCTURES, 2018, 51 : 158 - 175
  • [37] NESTED CROSSBAR CONNECTION NETWORKS FOR OPTICALLY INTERCONNECTED PROCESSOR ARRAYS FOR VECTOR MATRIX MULTIPLICATION
    FELDMAN, MR
    GUEST, CC
    APPLIED OPTICS, 1990, 29 (08): : 1068 - 1076
  • [38] Optimizing Matrix Multiplication on Intel® Xeon Phi™ x200 Architecture
    Guney, Murat E.
    Goto, Kazushige
    Costa, Timothy B.
    Knepper, Sarah
    Huot, Louise
    Mitrano, Arthur A.
    Story, Shane
    2017 IEEE 24TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2017, : 144 - 145
  • [39] Merge-based Parallel Sparse Matrix-Sparse Vector Multiplication with a Vector Architecture
    Li, Haoran
    Yokoyama, Harumichi
    Araki, Takuya
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 43 - 50
  • [40] Digital in-memory stochastic computing architecture for vector-matrix multiplication
    Agwa, Shady
    Prodromakis, Themis
    FRONTIERS IN NANOTECHNOLOGY, 2023, 5