VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

被引:6
作者
Castro, Roberto L. [1 ]
Ivanov, Andrei [2 ]
Andrade, Diego [1 ]
Ben-Nun, Tal [2 ]
Fraguela, Basilio B. [1 ]
Hoefler, Torsten [2 ]
机构
[1] Univ A Coruna, CITIC, La Coruna, Spain
[2] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland
来源
SC23:INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2023年
关键词
Neural Networks; Pruning; GPGPU; CUDA; Sparse Tensor Cores;
D O I
10.1145/3581784.3607087
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.
引用
收藏
页数:16
相关论文
共 38 条
[1]  
Brown TB, 2020, ADV NEUR IN, V33
[2]   Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs [J].
Castro, Roberto L. ;
Andrade, Diego ;
Fraguela, Basilio B. .
PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022, 2022, :135-147
[3]  
Chen Z., 2023, P 28 ACM SIGPLAN ANN, P369
[4]   Efficient Tensor Core -Based GPU Kernels for Structured Sparsity under Reduced Precision [J].
Chen, Zhaodong ;
Qu, Zheng ;
Liu, Liu ;
Ding, Yufei ;
Xie, Yuan .
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]  
Frantar E, 2023, Arxiv, DOI [arXiv:2301.00774, DOI 10.48550/ARXIV.2301.00774]
[7]  
Frantar E, 2021, Arxiv, DOI arXiv:2107.03356
[8]  
Gale T, 2019, Arxiv, DOI [arXiv:1902.09574, 10.48550/arXiv.1902.09574]
[9]   Sparse GPU Kernels for Deep Learning [J].
Gale, Trevor ;
Zaharia, Matei ;
Young, Cliff ;
Elsen, Erich .
PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
[10]  
Google Research, 2020, Deep Learning Matrix Collection