heFFTe: Highly Efficient FFT for Exascale

被引：29

作者：

Ayala, Alan ^{[1
]}

Tomov, Stanimire ^{[1
]}

Haidar, Azzam ^{[2
]}

Dongarra, Jack ^{[1
,3
,4
]}

机构：

[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37916 USA

[2] Nvidia Corp, Santa Clara, CA USA

[3] Oak Ridge Natl Lab, Oak Ridge, TN USA

[4] Univ Manchester, Manchester, Lancs, England

来源：

COMPUTATIONAL SCIENCE - ICCS 2020, PT I | 2020年 / 12137卷

关键词：

Exascale; FFT; Scalable algorithm; GPUs;

D O I：

10.1007/978-3-030-50371-0_19

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than 40x speedup with respect to local kernels from CPU state-of-the-art libraries, and over 2x speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.

引用

页码：262 / 275

页数：14

共 27 条

[1] [Anonymous], 2018, CUDA NVIDIA CUFFT LI
[2] [Anonymous], 2018, LARGE SCALE ATOMIC M
[3] Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation
Ayala, Alan
Tomov, Stanimire
Luo, Xi
Shaiek, Hejer
Haidar, Azzam
Bosilca, George
Dongarra, Jack
[J]. PROCEEDINGS OF 2019 IEEE/ACM WORKSHOP ON EXASCALE MPI (EXAMPI 2019), 2019, : 12 - 18
[4] Czechowski K, 2012, On the communication complexity of 3D FFTs and Its Implications for Exascale, P205, DOI [10.1145/2304576.2304604, DOI 10.1145/2304576.2304604]
[5] Communication-Avoiding Algorithms for Linear Algebra and Beyond
Demmel, James
[J]. IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013), 2013, : 585 - 585
[6] Emberson J., 2018, ECPADSE0140 MS
[7] Filippone S., 1996, Applied Parallel Computing. Computation in Physics, Chemistry and Engineering Science. Second International Workshop, PARA '95. Proceedings, P199
[8] Franchetti F, 2018, 2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING WORKSHOPS (HIPCW), P18, DOI [10.1109/HiPCW.2018.00011, 10.1109/HiPCW.2018.8634111]
[9] The design and implementation of FFTW3
Frigo, M
Johnson, SG
[J]. PROCEEDINGS OF THE IEEE, 2005, 93 (02) : 216 - 231
[10] Gholami A, 2016, Arxiv, DOI arXiv:1506.07933

← 1 2 3 →