Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices

被引：4

作者：

Gates, Mark ^{[1
]}

Kurzak, Jakub ^{[1
]}

Luszczek, Piotr ^{[1
]}

Pei, Yu ^{[1
]}

Dongarra, Jack ^{[2
,3
,4
]}

机构：

[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA

[2] Univ Tennessee, Knoxville, TN 37996 USA

[3] Oak Ridge Natl Lab, Oak Ridge, TN USA

[4] Univ Manchester, Manchester, Lancs, England

来源：

2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2017年

基金：

美国国家科学基金会;

关键词：

batch computation; GPU computing; numerical linear algebra; Cholesky factorization; data layout;

D O I：

10.1109/IPDPSW.2017.18

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices.

引用

页码：1408 / 1417

页数：10

共 22 条

[1] Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects [J].

Agullo, Emmanuel ;

Demmel, Jim ;

Dongarra, Jack ;

Hadri, Bilel ;

Kurzak, Jakub ;

Langou, Julien ;

Ltaief, Hatem ;

Luszczek, Piotr ;

Tomov, Stanimire .

SCIDAC 2009: SCIENTIFIC DISCOVERY THROUGH ADVANCED COMPUTING, 2009, 180

[2]

[Anonymous], 2015, ASSEMBLER NVIDIA MAX

[3]

[Anonymous], 2017, INT MATH KERN LIB DE

[4]

[Anonymous], 2014, DAGSTUHL REPORTS

[5]

[Anonymous], 2016, DU06702001V80 NVIDIA

[6]

Bilmes J., 1996, UTCS96326

[7] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[8] LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU [J].

Dong, Tingxing ;

Haidar, Azzam ;

Luszczek, Piotr ;

Harris, James Austin ;

Tomov, Stanimire ;

Dongarra, Jack .

2014 IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2014 IEEE 6TH INTL SYMP ON CYBERSPACE SAFETY AND SECURITY, 2014 IEEE 11TH INTL CONF ON EMBEDDED SOFTWARE AND SYST (HPCC,CSS,ICESS), 2014, :157-160

[9] A Fast Batched Cholesky Factorization on a GPU [J].

Dong, Tingxing ;

Haidar, Azzam ;

Tomov, Stanimire ;

Dongarra, Jack .

2014 43RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2014, :432-440

[10] The design and implementation of FFTW3 [J].

Frigo, M ;

Johnson, SG .

PROCEEDINGS OF THE IEEE, 2005, 93 (02) :216-231

← 1 2 3 →