Research on Optimal Performance of Sparse Matrix-Vector Multiplication and Convoulution Using the Probability-Process-Ram Model

被引:0
|
作者
Xie Z. [1 ,2 ,3 ]
Tan G. [1 ,2 ]
Sun N. [1 ,2 ]
机构
[1] State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
[2] Institute of Computing Technology, Chinese Academy of Sciences, Beijing
[3] School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2021年 / 58卷 / 03期
基金
中国国家自然科学基金;
关键词
Cache simulator; Convolu-tion; Feedback optimization; Performance model; Sparse matrix-vector multiplication;
D O I
10.7544/issn1000-1239.2021.20180601
中图分类号
学科分类号
摘要
Performance models provide insightful perspectives to allow us to predict performance and propose optimization guidance. Although there has been much research, pinpointing bottlenecks of various memory access patterns and reaching high performance of both regular and irregular programs on various hardware configurations are still not trivial. In this work, we propose a novel model called probability-process-ram (PPR) to quantify the amount of compute and data transfer time on general-purpose multicore processors. The PPR model predicts the number of instruction for single-core and probability of memory access between each memory hierarchy through a newly designed cache simulator. By using the automatically extracted best optimization method and expectation, we use PPR model for analyzing and optimizing sparse matrix-vector multiplication and 1D convolution as case study for typical irregular and regular computational kernels. Then we obtain best block sizes for sparse matrices with various sparsity structures, as well as optimal optimization guidance for 1D convolution with different instruction sets support and data sizes. Comparison with Roofline model and ECM model, the proposed PPR model greatly improves prediction accuracy by the newly designed cache simulator and achieves comprehensive feedback ability. © 2021, Science Press. All right reserved.
引用
收藏
页码:445 / 457
页数:12
相关论文
共 18 条
  • [1] Press W H, Teukolsky S A., Biconjugate gradient method for sparse linear systems, Computers in Physics, 6, 4, pp. 400-410, (1992)
  • [2] Li Kenli, Yang Wangdong, Li Keqin, Performance analysis and optimization for SpMV on GPU using probabilistic modeling, IEEE Transactions on Parallel and Distributed Systems, 26, 1, pp. 196-205, (2015)
  • [3] Li Shigang, Hu Changjun, Zhang Junchao, Et al., Automatic tuning of sparse matrix-vector multiplication on multicore clusters, Science China Information Sciences, 58, 9, pp. 1-14, (2015)
  • [4] Zitova B, Flusser J., Image registration methods: A survey, Image and Vision Computing, 21, 11, pp. 977-1000, (2003)
  • [5] Qadeer W, Hameed R, Shacham O, Et al., Convolution engine: Balancing efficiency and flexibility in specialized computing, Communications of the ACM, 41, 3, pp. 24-35, (2013)
  • [6] Uhl A., Wavelet packet best basis selection on moderate parallel MIMD architectures, Parallel Computing, 22, 1, pp. 149-158, (1996)
  • [7] Chakrabarti C, Vishwanath M., Efficient realizations of the discrete and continuous wavelet transforms: From single chip implementations to mappings on SIMD array computers, IEEE Transactions on Signal Processing, 43, 3, pp. 759-771, (1995)
  • [8] Konstantinidis E, Cotronis Y., A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling, Journal of Parallel and Distributed Computing, 107, 1, pp. 37-56, (2017)
  • [9] Ilic A, Pratas F, Sousa L., Cache-aware roofline model: Upgrading the loft, IEEE Computer Architecture Letters, 13, 1, pp. 21-24, (2013)
  • [10] Stengel H, Treibig J, Hager G, Et al., Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model, Proc of the 29th ACM on Int Conf on Supercomputing, pp. 207-216, (2015)