High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor Cores

被引:0
作者
Leng, Yuhan [1 ]
Zou, Gaoyuan [1 ]
Wang, Hansheng [1 ]
Wu, Panruo [2 ]
Zhang, Shaoshuai [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610054, Peoples R China
[2] Univ Houston, Dept Comp Sci, Houston, TX 77004 USA
关键词
Tensors; Numerical analysis; Vectors; Graphics processing units; Sparse matrices; Parallel processing; Matrix decomposition; Reflection; Computer architecture; Accuracy; HPC; GPGPU; numerical linear algebra; tensor cores; mixed-precision algorithms; ROUNDING ERROR ANALYSIS; CHOLESKY QR;
D O I
10.1109/TPDS.2024.3522776
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Since 2017, NVIDIA GPUs have been equipped with specialized units known as Tensor Cores, which demonstrate remarkable efficiency in processing matrix multiplications (GEMMs). Beyond GEMMs, researchers have explored the potential applications of Tensor Cores in matrix factorization, such as QR factorization. However, the inside GEMMs in QR factorization are typically tall and skinny. Compared to compute-bound square GEMMs, these tall and skinny GEMMs are memory bound, leading to suboptimal performance on Tensor Cores. To solve this problem, we indicate the recursive QR factorization can convert the tall and skinny GEMMs to relatively square and large GEMMs, resulting in better performance on Tensor Cores. Besides, we extend the FP16 Tensor-Cores-based QR factorization to accommodate FP32 and FP64 on FP16 and INT8 Tensor Cores, respectively. Additionally, to address the issue of orthogonality loss in the preceding Tensor Cores-based QR factorization, we transition from the Gram-Schmidt to the Householder algorithm while preserving high performance. According to our experimental evaluation conducted on NVIDIA's A100 and GeForce RTX 3090 GPU, the precision levels of FP64, FP32, and FP16 are up to 6.22x, 8.67x, and 4.03x faster, respectively, than the current state-of-the-art implementations.
引用
收藏
页码:422 / 436
页数:15
相关论文
共 48 条
  • [1] Anderson E., 1999, LAPACK Users' Guide, V9, DOI [10.1137/1.9780898719604, DOI 10.1137/1.9780898719604]
  • [2] Anderson M., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P48, DOI 10.1109/IPDPS.2011.15
  • [3] [Anonymous], 1997, ScaLAPACK Users' Guide
  • [4] Reconstructing Householder Vectors from Tall-Skinny QR
    Ballard, Grey
    Demmel, James
    Grigori, Laura
    Jacquelin, Mathias
    Nguyen, Hong Diep
    Solomonik, Edgar
    [J]. 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
  • [5] THE WY REPRESENTATION FOR PRODUCTS OF HOUSEHOLDER MATRICES
    BISCHOF, C
    VANLOAN, C
    [J]. SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1987, 8 (01): : S2 - S13
  • [6] BJO, 1967, BIT, V7, P1, DOI [DOI 10.1007/BF01934122, 10.1007/BF01934122]
  • [7] MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES
    Blanchard, Pierre
    Higham, Nicholas J.
    Lopez, Florent
    Mary, Theo
    Pranesh, Srikara
    [J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2020, 42 (03) : C124 - C141
  • [8] Bouwmeester H., 2011, P 2011 INT C HIGH PE, P1
  • [9] RTX on-The NVIDIA Turing GPU
    Burgess, John
    [J]. IEEE MICRO, 2020, 40 (02) : 36 - 44
  • [10] Carson E, 2021, Arxiv, DOI arXiv:2010.12058