High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor Cores

被引：0

作者：

Leng, Yuhan ^{[1
]}

Zou, Gaoyuan ^{[1
]}

Wang, Hansheng ^{[1
]}

Wu, Panruo ^{[2
]}

Zhang, Shaoshuai ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610054, Peoples R China

[2] Univ Houston, Dept Comp Sci, Houston, TX 77004 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2025年 / 36卷 / 03期

关键词：

Tensors; Numerical analysis; Vectors; Graphics processing units; Sparse matrices; Parallel processing; Matrix decomposition; Reflection; Computer architecture; Accuracy; HPC; GPGPU; numerical linear algebra; tensor cores; mixed-precision algorithms; ROUNDING ERROR ANALYSIS; CHOLESKY QR;

D O I：

10.1109/TPDS.2024.3522776

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Since 2017, NVIDIA GPUs have been equipped with specialized units known as Tensor Cores, which demonstrate remarkable efficiency in processing matrix multiplications (GEMMs). Beyond GEMMs, researchers have explored the potential applications of Tensor Cores in matrix factorization, such as QR factorization. However, the inside GEMMs in QR factorization are typically tall and skinny. Compared to compute-bound square GEMMs, these tall and skinny GEMMs are memory bound, leading to suboptimal performance on Tensor Cores. To solve this problem, we indicate the recursive QR factorization can convert the tall and skinny GEMMs to relatively square and large GEMMs, resulting in better performance on Tensor Cores. Besides, we extend the FP16 Tensor-Cores-based QR factorization to accommodate FP32 and FP64 on FP16 and INT8 Tensor Cores, respectively. Additionally, to address the issue of orthogonality loss in the preceding Tensor Cores-based QR factorization, we transition from the Gram-Schmidt to the Householder algorithm while preserving high performance. According to our experimental evaluation conducted on NVIDIA's A100 and GeForce RTX 3090 GPU, the precision levels of FP64, FP32, and FP16 are up to 6.22x, 8.67x, and 4.03x faster, respectively, than the current state-of-the-art implementations.

引用

页码：422 / 436

页数：15

共 48 条

[1] Anderson E., 1999, LAPACK Users' Guide, V9, DOI [10.1137/1.9780898719604, DOI 10.1137/1.9780898719604]
[2] Anderson M., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P48, DOI 10.1109/IPDPS.2011.15
[3] [Anonymous], 1997, ScaLAPACK Users' Guide
[4] Reconstructing Householder Vectors from Tall-Skinny QR
Ballard, Grey
Demmel, James
Grigori, Laura
Jacquelin, Mathias
Nguyen, Hong Diep
Solomonik, Edgar
[J]. 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[5] THE WY REPRESENTATION FOR PRODUCTS OF HOUSEHOLDER MATRICES
BISCHOF, C
VANLOAN, C
[J]. SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1987, 8 (01): : S2 - S13
[6] BJO, 1967, BIT, V7, P1, DOI [DOI 10.1007/BF01934122, 10.1007/BF01934122]
[7] MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES
Blanchard, Pierre
Higham, Nicholas J.
Lopez, Florent
Mary, Theo
Pranesh, Srikara
[J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2020, 42 (03) : C124 - C141
[8] Bouwmeester H., 2011, P 2011 INT C HIGH PE, P1
[9] RTX on-The NVIDIA Turing GPU
Burgess, John
[J]. IEEE MICRO, 2020, 40 (02) : 36 - 44
[10] Carson E, 2021, Arxiv, DOI arXiv:2010.12058

← 1 2 3 4 5 →