Optimizing sparse tensor times matrix on GPUs

被引:22
作者
Ma, Yuchen [1 ]
Li, Jiajia [2 ]
Wu, Xiaolong [3 ]
Yan, Chenggang [1 ]
Sun, Jimeng [2 ]
Vuduc, Richard [2 ]
机构
[1] Hangzhou Dianzi Univ, Inst Informat & Control, Hangzhou, Zhejiang, Peoples R China
[2] Georgia Inst Technol, Computat Sci & Engn, Atlanta, GA 30332 USA
[3] Virginia Tech, Comp Sci, Blacksburg, VA 24061 USA
基金
美国国家科学基金会;
关键词
Sparse tensors; Irregular algorithms; Tensor decomposition; GPU; MULTIPLICATION; SPMV; DECOMPOSITIONS; IMPLEMENTATION; LIBRARY;
D O I
10.1016/j.jpdc.2018.07.018
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This work optimizes tensor-times-dense matrix multiply (TTM) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. TTM is a computational kernel in tensor methods-based data analytics and data mining applications, such as the popular Tucker decomposition. We first design an in-place sequential SPTTM to avoid explicit data reorganizing between a tensor and a matrix in its conventional approach. We further optimize SPTTM on NVIDIA GPU platforms. Five approaches including employing fine thread granularity, arranging coalesced memory access, rank blocking, and using fast GPU shared memory are developed for GPU-SPTTM. We also optimize semi-sparse tensor-times-dense matrix multiply (SSPTTM) to take advantage of the inside dense sub-structures. The optimized SPTTM and SSPTTM are applied to Tucker decomposition to improve its overall performance. Our sequential SPTTM is 3-120x faster than the SPTTM from Tensor Toolbox library. GPU-SPTTM obtains 6-19x speedup on NVIDIA K40c and 23-67x speedup on NVIDIA P100 over CPU-SPTTM respectively. Our GPU-SPTTM is 3.9x faster than the state-of-the-art GPU implementation. Our SSPTTM implementations outperform SPTTMS by up to 4.5x, which handles the input semi-sparse tensor in a general way. Tucker decomposition achieves up to 3.2x speedup after applying the optimized TTMS. The code will be publicly released in PARTI! library: https://github.com/hpcgarage/ParTI. (C) 2018 Elsevier Inc. All rights reserved.
引用
收藏
页码:99 / 109
页数:11
相关论文
共 64 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Anandkumar A, 2014, J MACH LEARN RES, V15, P2773
[3]  
[Anonymous], CUDA C BEST PRACT GU
[4]  
[Anonymous], 2015, Ph.D. Dissertation
[5]  
[Anonymous], PROCEEDINGS OF THE T
[6]  
[Anonymous], IEEE INT C DAT MIN I
[7]  
[Anonymous], 2012, P IEEE C HIGH PERF E
[8]  
[Anonymous], 160305627 ARXIV
[9]  
[Anonymous], TECH REP
[10]  
[Anonymous], 2015, HATEN2 BILLION SCALE