High-Performance Tensor Contractions for GPUs

被引:39
作者
Abdelfattah, A. [1 ]
Baboulin, M. [2 ]
Dobrev, V. [3 ]
Dongarra, J. [1 ,4 ]
Earl, C. [3 ]
Falcou, J. [2 ]
Haidar, A. [1 ]
Karlin, I. [3 ]
Kolev, Tz. [3 ]
Masliah, I. [2 ]
Tomov, S. [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
[2] Univ Paris Sud, Orsay, France
[3] Lawrence Livermore Natl Lab, Livermore, CA 94550 USA
[4] Univ Manchester, Manchester, Lancs, England
来源
INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016) | 2016年 / 80卷
基金
美国国家科学基金会;
关键词
Tensor contractions; Tensor HPC; GPU; Batched linear algebra; FEM; Applications;
D O I
10.1016/j.procs.2016.05.302
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8x faster than CUBLAS, and 8.5x faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
引用
收藏
页码:108 / 118
页数:11
相关论文
共 20 条
[1]  
Abdelfattah Ahmad, 2016, ISC HIGH PERFORMANCE
[2]  
[Anonymous], 2009, FUT DIR TENS BAS COM
[3]  
Baboulin M., 2015, SMOK MOUNT COMP SCI
[4]  
Bell N., 2011, GPU computing gems Jade edition, V2, P359
[5]   HIGH-ORDER CURVILINEAR FINITE ELEMENT METHODS FOR LAGRANGIAN HYDRODYNAMICS [J].
Dobrev, Veselin A. ;
Kolev, Tzanio V. ;
Rieben, Robert N. .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2012, 34 (05) :B606-B641
[6]  
Dong T., 2014, HPCC 2014
[7]   A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU [J].
Dong, Tingxing ;
Dobrev, Veselin ;
Kolev, Tzanio ;
Rieben, Robert ;
Tomov, Stanimire ;
Dongarra, Jack .
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[8]   A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations [J].
Haidar, Azzam ;
Dong, Tingxing Tim ;
Tomov, Stanimire ;
Luszczek, Piotr ;
Dongarra, Jack .
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2015, 2015, 9137 :31-47
[9]   Batched matrix computations on hardware accelerators based on GPUs [J].
Haidar, Azzam ;
Dong, Tingxing ;
Luszczek, Piotr ;
Tomov, Stanimire ;
Dongarra, Jack .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2015, 29 (02) :193-208
[10]   Tensor Decompositions and Applications [J].
Kolda, Tamara G. ;
Bader, Brett W. .
SIAM REVIEW, 2009, 51 (03) :455-500