Optimizing tensor contraction expressions for hybrid CPU-GPU execution

被引:29
作者
Ma, Wenjing [1 ]
Krishnamoorthy, Sriram [1 ]
Villa, Oreste [1 ]
Kowalski, Karol [1 ,3 ]
Agrawal, Gagan [2 ]
机构
[1] Pacific NW Natl Lab, Computat Sci & Math Div, Richland, WA 99352 USA
[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[3] Pacific NW Natl Lab, Environm Mol Sci Lab, Richland, WA 99352 USA
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2013年 / 16卷 / 01期
关键词
Hybrid CPU plus GPU execution; CUDA; Tensor Contraction Expressions; GRAPHICS PROCESSING UNITS; COUPLED-CLUSTER THEORY; QUANTUM-CHEMISTRY; PERFORMANCE; PARALLELISM; PROGRAMS; SYSTEMS; ENGINE; MODEL; CUDA;
D O I
10.1007/s10586-011-0179-2
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU as compared to one CPU core and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). We further investigate tensor contraction code on a new series of GPUs, the Fermi GPUs, and provide several effective optimization algorithms. For the same computation of CCSD(T), on a cluster with Fermi GPUs, we achieve a speedup of 3.4 over a cluster with T10 GPUs. With a single Fermi GPU on each node, we achieve a speedup of 43 over the sequential CPU version.
引用
收藏
页码:131 / 155
页数:25
相关论文
共 44 条
[1]  
[Anonymous], 2010, NVIDIA CUDA Programming Guide
[2]  
[Anonymous], 2009, 320412 INTEL
[3]  
Anzt Hartwig., 2010, GPU Accelerated Scientific Computing
[4]  
Apra E., 2009, P ACM IEEE SC C HIGH, P1, DOI DOI 10.1145/1654059.1654127
[5]   Automatic code generation for many-body electronic structure methods: the tensor contraction engine [J].
Auer, AA ;
Baumgartner, G ;
Bernholdt, DE ;
Bibireata, A ;
Choppella, V ;
Cociorva, D ;
Gao, XY ;
Harrison, R ;
Krishnamoorthy, S ;
Krishnan, S ;
Lam, CC ;
Lu, QD ;
Nooijen, M ;
Pitzer, R ;
Ramanujam, J ;
Sadayappan, P ;
Sibiryakov, A .
MOLECULAR PHYSICS, 2006, 104 (02) :211-228
[6]   An Adaptive Performance Modeling Tool for GPU Architectures [J].
Baghsorkhi, Sara S. ;
Delahaye, Matthieu ;
Patel, Sanjay J. ;
Gropp, William D. ;
Hwu, Wen-mei W. .
ACM SIGPLAN NOTICES, 2010, 45 (05) :105-114
[7]   Coupled-cluster theory in quantum chemistry [J].
Bartlett, Rodney J. ;
Musial, Monika .
REVIEWS OF MODERN PHYSICS, 2007, 79 (01) :291-352
[8]  
Baskaran MM, 2008, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, P225
[9]   Synthesis of high-performance parallel programs for a class of Ab Initio quantum chemistry models [J].
Baumgartner, G ;
Auer, AA ;
Bernholdt, DE ;
Bibireata, A ;
Choppella, V ;
Cociorva, D ;
Gao, XY ;
Harrison, RJ ;
Hirata, S ;
Krishnamoorthy, S ;
Krishnan, S ;
Lam, CC ;
Lu, QD ;
Nooijen, M ;
Pitzer, RM ;
Ramanujam, J ;
Sadayappan, P ;
Sibiryakov, A .
PROCEEDINGS OF THE IEEE, 2005, 93 (02) :276-292
[10]  
BOYER M, 2009, PAR DISTR PROC 2009, P1