High-Performance Tensor Contractions for GPUs

被引：39

作者：

Abdelfattah, A. ^{[1
]}

Baboulin, M. ^{[2
]}

Dobrev, V. ^{[3
]}

Dongarra, J. ^{[1
,4
]}

Earl, C. ^{[3
]}

Falcou, J. ^{[2
]}

Haidar, A. ^{[1
]}

Karlin, I. ^{[3
]}

Kolev, Tz. ^{[3
]}

Masliah, I. ^{[2
]}

Tomov, S. ^{[1
]}

机构：

[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA

[2] Univ Paris Sud, Orsay, France

[3] Lawrence Livermore Natl Lab, Livermore, CA 94550 USA

[4] Univ Manchester, Manchester, Lancs, England

来源：

INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016) | 2016年 / 80卷

基金：

美国国家科学基金会;

关键词：

Tensor contractions; Tensor HPC; GPU; Batched linear algebra; FEM; Applications;

D O I：

10.1016/j.procs.2016.05.302

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8x faster than CUBLAS, and 8.5x faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

引用

页码：108 / 118

页数：11

共 20 条

[1]

Abdelfattah Ahmad, 2016, ISC HIGH PERFORMANCE

[2]

[Anonymous], 2009, FUT DIR TENS BAS COM

[3]

Baboulin M., 2015, SMOK MOUNT COMP SCI

[4]

Bell N., 2011, GPU computing gems Jade edition, V2, P359