JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication

被引:0
作者
Fu, Qiang [1 ,3 ]
Rolinger, Thomas B. [2 ,4 ]
Huang, H. Howie [3 ]
机构
[1] Adv Micro Devices Inc, Austin, TX 95054 USA
[2] NVIDIA, Austin, TX USA
[3] George Washington Univ, Washington, DC USA
[4] Lab Phys Sci, Austin, TX USA
来源
2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO | 2024年
基金
美国国家科学基金会;
关键词
SpMM; Just-in-Time Instruction Generation; Performance Profiling; Performance Optimization;
D O I
10.1109/CGO57630.2024.10444827
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Achieving high performance for Sparse MatrixMatrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the aheadof-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSPMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSPMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSPMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSPMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSPMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSPMM provides an average improvement of 3.8x and 1.4x, respectively.
引用
收藏
页码:448 / 459
页数:12
相关论文
共 53 条
  • [1] Computing Graph Neural Networks: A Survey from Algorithms to Accelerators
    Abadal, Sergi
    Jain, Akshay
    Guirado, Robert
    Lopez-Alonso, Jorge
    Alarcon, Eduard
    [J]. ACM COMPUTING SURVEYS, 2022, 54 (09)
  • [2] [Anonymous], 2023, Asmjit project: Low-latency machine code generation
  • [3] A survey of adaptive optimization in virtual machines
    Arnold, M
    Fink, SJ
    Grove, D
    Hind, M
    Sweeney, PF
    [J]. PROCEEDINGS OF THE IEEE, 2005, 93 (02) : 449 - 466
  • [4] Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications
    Ashari, Arash
    Sedaghati, Naser
    Eisenlohr, John
    Parthasarathy, Srinivasan
    Sadayappan, P.
    [J]. SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 781 - 792
  • [5] Compiler Support for Sparse Tensor Computations in MLIR
    Bik, Aart
    Koanantakool, Penporn
    Shpeisman, Tatiana
    Vasilache, Nicolas
    Zheng, Bixia
    Kjolstad, Fredrik
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (04)
  • [6] Chaitin G. J., 1982, SIGPLAN Notices, V17, P98, DOI 10.1145/872726.806984
  • [7] Chandra R., 2001, Parallel programming in OpenMP.
  • [8] Compilation of Dynamic Sparse Tensor Algebra
    Chou, Stephen
    Amarasinghe, Saman
    [J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2022, 6 (OOPSLA):
  • [9] Optimizing Sparse Matrix Operations on GPUs using Merge Path
    Dalton, Steven
    Olson, Luke
    Baxter, Sean
    Merrill, Duane
    Garland, Michael
    [J]. 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 407 - 416
  • [10] The University of Florida Sparse Matrix Collection
    Davis, Timothy A.
    Hu, Yifan
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2011, 38 (01):