Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

被引：35

作者：

Kurzak, Jakub ^{[1
]}

Alvaro, Wesley ^{[1
]}

Dongarra, Jack ^{[1
,2
,3
,4
]}

机构：

[1] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA

[2] Oak Ridge Natl Lab, Div Math & Comp Sci, Oak Ridge, TN USA

[3] Univ Manchester, Sch Math, Manchester, NH USA

[4] Univ Manchester, Sch Comp Sci, Manchester, NH USA

来源：

PARALLEL COMPUTING | 2009年 / 35卷 / 03期

关键词：

Instruction level parallelism; Single Instruction Multiple Data; Synergistic Processing Element; Loop optimizations; Vectorization; LINEAR-EQUATIONS; SOLVING SYSTEMS;

D O I：

10.1016/j.parco.2008.12.010

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigen-value computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C = C - A x B-T operation and the C = C - A x B operation for matrices of size 64 x 64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures. (C) 2009 Elsevier B.V. All rights reserved.

引用

页码：138 / 150

页数：13

共 50 条

[1] Fast and small short vector SIMD matrix multiplication kernels for the synergistic processing element of the CELL processor
Alvaro, Wesley
Kurzak, Jakub
Dongarra, Jack
COMPUTATIONAL SCIENCE - ICCS 2008, PT 1, 2008, 5101 : 935 - 944
[2] Performance of an embedded optical vector matrix multiplication processor architecture
Yang, C.
Cui, G. X.
Huang, Y. Y.
Wu, L.
Yang, H.
Zhang, Y. H.
IET OPTOELECTRONICS, 2010, 4 (04) : 159 - 164
[3] Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures
Henretty, Tom
Stock, Kevin
Pouchet, Louis-Noel
Franchetti, Franz
Ramanujam, J.
Sadayappan, P.
COMPILER CONSTRUCTION, 2011, 6601 : 225 - +
[4] A new parallel DSP with short-vector memory architecture
Fridman, J
Anderson, WC
ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 2139 - 2142
[5] Optimizing matrix-matrix multiplication on intel's advanced vector extensions multicore processor
Hemeida, A. M.
Hassan, S. A.
Alkhalaf, Salem
Mahmoud, M. M. M.
Saber, M. A.
Eldin, Ayman M. Bahaa
Senjyu, Tomonobu
Alayed, Abdullah H.
AIN SHAMS ENGINEERING JOURNAL, 2020, 11 (04) : 1179 - 1190
[6] An Embedded Optical Vector Matrix Multiplication Processor
Fuhui-kai
INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY, PTS 1-4, 2013, 263-266 : 1334 - 1337
[7] An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUs
Maggioni, Marco
Berger-Wolf, Tanya
2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2013, 18 : 329 - 338
[8] Controlled optical fibre processor for matrix/vector multiplication
Pilipovich, VA
Esman, AK
Goncharenko, IA
Posedko, VS
Solonovich, IF
SECOND INTERNATIONAL CONFERENCE ON OPTICAL INFORMATION PROCESSING, 1996, 2969 : 125 - 128
[9] An efficient SIMD compression format for sparse matrix-vector multiplication
Chen, Xinhai
Xie, Peizhen
Chi, Lihua
Liu, Jie
Gong, Chunye
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (23):
[10] SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision
Hishinuma, Toshiaki
Hasegawa, Hidehiko
Tanaka, Teruo
HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 21 - 34

← 1 2 3 4 5 →