JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication

被引：0

作者：

Fu, Qiang ^{[1
,3
]}

Rolinger, Thomas B. ^{[2
,4
]}

Huang, H. Howie ^{[3
]}

机构：

[1] Adv Micro Devices Inc, Austin, TX 95054 USA

[2] NVIDIA, Austin, TX USA

[3] George Washington Univ, Washington, DC USA

[4] Lab Phys Sci, Austin, TX USA

来源：

2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO | 2024年

基金：

美国国家科学基金会;

关键词：

SpMM; Just-in-Time Instruction Generation; Performance Profiling; Performance Optimization;

D O I：

10.1109/CGO57630.2024.10444827

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Achieving high performance for Sparse MatrixMatrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the aheadof-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSPMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSPMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSPMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSPMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSPMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSPMM provides an average improvement of 3.8x and 1.4x, respectively.

引用

页码：448 / 459

页数：12

共 53 条

[1] Computing Graph Neural Networks: A Survey from Algorithms to Accelerators
Abadal, Sergi
Jain, Akshay
Guirado, Robert
Lopez-Alonso, Jorge
Alarcon, Eduard
[J]. ACM COMPUTING SURVEYS, 2022, 54 (09)
[2] [Anonymous], 2023, Asmjit project: Low-latency machine code generation
[3] A survey of adaptive optimization in virtual machines
Arnold, M
Fink, SJ
Grove, D
Hind, M
Sweeney, PF
[J]. PROCEEDINGS OF THE IEEE, 2005, 93 (02) : 449 - 466
[4] Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications
Ashari, Arash
Sedaghati, Naser
Eisenlohr, John
Parthasarathy, Srinivasan
Sadayappan, P.
[J]. SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 781 - 792
[5] Compiler Support for Sparse Tensor Computations in MLIR
Bik, Aart
Koanantakool, Penporn
Shpeisman, Tatiana
Vasilache, Nicolas
Zheng, Bixia
Kjolstad, Fredrik
[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (04)
[6] Chaitin G. J., 1982, SIGPLAN Notices, V17, P98, DOI 10.1145/872726.806984
[7] Chandra R., 2001, Parallel programming in OpenMP.
[8] Compilation of Dynamic Sparse Tensor Algebra
Chou, Stephen
Amarasinghe, Saman
[J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2022, 6 (OOPSLA):
[9] Optimizing Sparse Matrix Operations on GPUs using Merge Path
Dalton, Steven
Olson, Luke
Baxter, Sean
Merrill, Duane
Garland, Michael
[J]. 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 407 - 416
[10] The University of Florida Sparse Matrix Collection
Davis, Timothy A.
Hu, Yifan
[J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2011, 38 (01):

← 1 2 3 4 5 6 →