HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

被引:6
作者
Kang, Homin [1 ]
Kwon, Hyuck Chan [1 ]
Kim, Duksu [1 ]
机构
[1] Korea Univ Technol & Educ KOREATECH, Cheonan, South Korea
基金
新加坡国家研究基金会;
关键词
Matrix multiplication; Parallel algorithm; Heterogeneous; GPU; Strassen;
D O I
10.1007/s00607-020-00846-1
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen's method, we represent matrix multiplication work as a set of matrix addition and multiplication tasks among their sub-matrices. Then, we distribute the tasks to CPUs and GPUs while considering the characteristics of the tasks and computing resources to minimize the data communication overhead and fully utilize the available computing power. To handle a large matrix efficiently with limited GPU memory, we also propose a block-based work decomposition method. We then further improve the performance of our method by exploiting the concurrent execution abilities of a heterogeneous parallel computing system. We implemented our method on five different heterogeneous systems and applied it to matrices of various sizes. Our method generally shows higher performance than the prior GPU-based matrix multiplication methods. Moreover, compared with the state-of-the-art GPU matrix multiplication library (i.e., CUBLAS), our method achieved up to 1.97 times higher performance using the same GPUs and CPU cores. In some cases, our method using a low-performance GPU (e.g., GTX 1060, 3 GB) achieved performance comparable to that of CUBLAS using a high-performance GPU (e.g., RTX 2080, 8 GB). Also, our method continually improves performance as we use more computing resources like additional CPU cores and GPUs. We could achieve such high performance because our approach fully utilized the capacities of the given heterogeneous parallel computing systems while employing the Strassen's method, which has a lower asymptotic complexity. These results demonstrate the efficiency and robustness of our algorithm.
引用
收藏
页码:2607 / 2631
页数:25
相关论文
共 35 条
[1]   Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs [J].
Abdelfattah, Ahmad ;
Tomov, Stanimire ;
Dongarra, Jack .
2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, :111-122
[2]   A three-dimensional approach to parallel matrix multiplication [J].
Agarwal, RC ;
Balle, SM ;
Gustavson, FG ;
Joshi, M ;
Palkar, P .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1995, 39 (05) :575-582
[3]  
[Anonymous], 2018, NVIDIA CUBLAS LIB
[4]  
Ballard G., 2012, P 24 ANN ACM S PAR A, P193, DOI DOI 10.1145/2312005.2312044
[5]  
Barrachina S, 2008, 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, P3103
[6]  
Chtchelkanova A, 1997, CONCURRENCY-PRACT EX, V9, P837, DOI 10.1002/(SICI)1096-9128(199709)9:9<837::AID-CPE267>3.0.CO
[7]  
2-2
[8]   MATRIX MULTIPLICATION VIA ARITHMETIC PROGRESSIONS [J].
COPPERSMITH, D ;
WINOGRAD, S .
JOURNAL OF SYMBOLIC COMPUTATION, 1990, 9 (03) :251-280
[9]   OpenMP: An industry standard API for shared-memory programming [J].
Dagum, L ;
Menon, R .
IEEE COMPUTATIONAL SCIENCE & ENGINEERING, 1998, 5 (01) :46-55
[10]  
Fatahalian K., 2004, PROC ACM SIGGRAPHEUR, P133, DOI 10.1145/1058129.1058148