Fast multipole methods on graphics processors

被引:130
作者
Gumerov, Nail A. [1 ]
Duraiswami, Ramani
机构
[1] Univ Maryland, Perceptual Interfaces & Real Lab, Comp Sci & UMIACS, College Pk, MD 20742 USA
关键词
D O I
10.1016/j.jcp.2008.05.023
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The fast multipole method allows the rapid approximate evaluation of sums of radial basis functions. For a specified accuracy, epsilon, the method scales as O(N) in both time and memory compared to the direct method with complexity O(N-2), which allows the solution of larger problems with given resources. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on data-parallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architecture; and determined optimal settings for the FMM on the GPU. These optimal settings are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range of 30-60 compared to a serial CPU FMM implementation. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at a 43 Teraflop rate if we use straightforward summation. (c) 2008 Elsevier Inc. All rights reserved.
引用
收藏
页码:8290 / 8313
页数:24
相关论文
共 27 条
[1]  
[Anonymous], 1964, Handbook of mathematical functions
[2]  
[Anonymous], NVIDIA CUDA COMP UN
[3]   A fast adaptive multipole algorithm in three dimensions [J].
Cheng, H ;
Greengard, L ;
Rokhlin, V .
JOURNAL OF COMPUTATIONAL PHYSICS, 1999, 155 (02) :468-498
[4]   The top 10 algorithms [J].
Dongarra, J ;
Sullivan, F .
COMPUTING IN SCIENCE & ENGINEERING, 2000, 2 (01) :22-23
[5]   MULTIPOLE TRANSLATION THEORY FOR THE 3-DIMENSIONAL LAPLACE AND HELMHOLTZ EQUATIONS [J].
EPTON, MA ;
DEMBART, B .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 1995, 16 (04) :865-897
[6]   GRAPE-6A: A single-card GRAPE-6 for parallel PC-GRAPE cluster systems [J].
Fukushige, T ;
Makino, J ;
Kawai, A .
PUBLICATIONS OF THE ASTRONOMICAL SOCIETY OF JAPAN, 2005, 57 (06) :1009-1021
[7]  
GOEDDEKE D, 2005, FRONTIERS SIMULATION, P139
[8]   A FAST ALGORITHM FOR PARTICLE SIMULATIONS [J].
GREENGARD, L ;
ROKHLIN, V .
JOURNAL OF COMPUTATIONAL PHYSICS, 1987, 73 (02) :325-348
[9]  
Greengard L., 1997, Acta Numerica, V6, P229, DOI 10.1017/S0962492900002725
[10]  
GREENGARD L, 1990, COMPUT MATH APP, V20