Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

被引：35

作者：

Kurzak, Jakub ^{[1
]}

Alvaro, Wesley ^{[1
]}

Dongarra, Jack ^{[1
,2
,3
,4
]}

机构：

[1] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA

[2] Oak Ridge Natl Lab, Div Math & Comp Sci, Oak Ridge, TN USA

[3] Univ Manchester, Sch Math, Manchester, NH USA

[4] Univ Manchester, Sch Comp Sci, Manchester, NH USA

来源：

PARALLEL COMPUTING | 2009年 / 35卷 / 03期

关键词：

Instruction level parallelism; Single Instruction Multiple Data; Synergistic Processing Element; Loop optimizations; Vectorization; LINEAR-EQUATIONS; SOLVING SYSTEMS;

D O I：

10.1016/j.parco.2008.12.010

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigen-value computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C = C - A x B-T operation and the C = C - A x B operation for matrices of size 64 x 64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures. (C) 2009 Elsevier B.V. All rights reserved.

引用

页码：138 / 150

页数：13

共 50 条

[31] Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture
Aananthakrishnan, Sriram
Pawlowski, Robert
Fryman, Joshua
Hur, Ibrahim
2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
[32] Charge-mode parallel architecture for vector-matrix multiplication
Genov, R
Cauwenberghs, G
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-ANALOG AND DIGITAL SIGNAL PROCESSING, 2001, 48 (10): : 930 - 936
[33] The vector fixed point unit of the synergistic processor element of the cell architecture processor
Mäding, N
Leenstra, J
Pille, J
Sautter, R
Büttner, S
Ehrenreich, S
Haller, W
ESSCIRC 2005: PROCEEDINGS OF THE 31ST EUROPEAN SOLID-STATE CIRCUITS CONFERENCE, 2005, : 203 - 206
[34] The Vector Fixed Point Unit of the synergistic processor element of the cell architecture processor
Maeding, N.
Leenstra, J.
Pille, J.
Sautter, R.
Buettner, S.
Ehrenreich, S.
Haller, W.
2006 DESIGN AUTOMATION AND TEST IN EUROPE, VOLS 1-3, PROCEEDINGS, 2006, : 1579 - +
[35] Optimizing Sparse Matrix-Vector Multiplication on GPUs via Index Compression
Sun, Xue
Wei, Kai-Cheng
Lai, Lien-Fu
Tsai, Sung-Han
Wu, Chao-Chin
PROCEEDINGS OF 2018 IEEE 3RD ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC 2018), 2018, : 598 - 602
[36] Effective Implementation of Matrix-Vector Multiplication on Intel's AVX multicore Processor
Hassan, Somaia A.
Mahmoud, Mountasser M. M.
Hemeida, A. M.
Saber, Mahmoud A.
COMPUTER LANGUAGES SYSTEMS & STRUCTURES, 2018, 51 : 158 - 175
[37] NESTED CROSSBAR CONNECTION NETWORKS FOR OPTICALLY INTERCONNECTED PROCESSOR ARRAYS FOR VECTOR MATRIX MULTIPLICATION
FELDMAN, MR
GUEST, CC
APPLIED OPTICS, 1990, 29 (08): : 1068 - 1076
[38] Optimizing Matrix Multiplication on Intel® Xeon Phi™ x200 Architecture
Guney, Murat E.
Goto, Kazushige
Costa, Timothy B.
Knepper, Sarah
Huot, Louise
Mitrano, Arthur A.
Story, Shane
2017 IEEE 24TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2017, : 144 - 145
[39] Merge-based Parallel Sparse Matrix-Sparse Vector Multiplication with a Vector Architecture
Li, Haoran
Yokoyama, Harumichi
Araki, Takuya
IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 43 - 50
[40] Digital in-memory stochastic computing architecture for vector-matrix multiplication
Agwa, Shady
Prodromakis, Themis
FRONTIERS IN NANOTECHNOLOGY, 2023, 5

← 1 2 3 4 5 →