Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

被引:0
作者
Donglin Chen
Jianbin Fang
Shizhao Chen
Chuanfu Xu
Zheng Wang
机构
[1] National University of Defense Technology,College of Computer Science
[2] Lancaster University,School of Computing and Communications
来源
International Journal of Parallel Programming | 2019年 / 47卷
关键词
SpMV; Sparse matrix format; Many-core; Performance tuning;
D O I
暂无
中图分类号
学科分类号
摘要
Sparse matrix–vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8-based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.
引用
收藏
页码:418 / 432
页数:14
相关论文
共 10 条
  • [1] Che Y(2015)Realistic performance characterization of CFD applications on intel many integrated core architecture Comput. J. 58 3279-3294
  • [2] Xu C(2015)The effect of numa tunings on cpu performance J. Phys. Conf. Ser. 664 1-7
  • [3] Fang J(2015)Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors Parallel Comput. 49 179-193
  • [4] Wang Y(2004)Optimizing sparse matrix-vector product computations using unroll and jam IJHPCA 18 225-236
  • [5] Wang Z(undefined)undefined undefined undefined undefined-undefined
  • [6] Hollowell C(undefined)undefined undefined undefined undefined-undefined
  • [7] Liu W(undefined)undefined undefined undefined undefined-undefined
  • [8] Vinter B(undefined)undefined undefined undefined undefined-undefined
  • [9] Mellor-Crummey JM(undefined)undefined undefined undefined undefined-undefined
  • [10] Garvin J(undefined)undefined undefined undefined undefined-undefined