High Performance Graph Data Imputation on Multiple GPUs

被引：1

作者：

Zhou, Chao ^{[1
]}

Zhang, Tao ^{[1
]}

机构：

[1] Shanghai Univ, Sch Comp Engn & Sci, Shanghai 200444, Peoples R China

来源：

FUTURE INTERNET | 2021年 / 13卷 / 02期

关键词：

GPU; data imputation; graph-tensor; LIBRARY;

D O I：

10.3390/fi13020036

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In real applications, massive data with graph structures are often incomplete due to various restrictions. Therefore, graph data imputation algorithms have been widely used in the fields of social networks, sensor networks, and MRI to solve the graph data completion problem. To keep the data relevant, a data structure is represented by a graph-tensor, in which each matrix is the vertex value of a weighted graph. The convolutional imputation algorithm has been proposed to solve the low-rank graph-tensor completion problem that some data matrices are entirely unobserved. However, this data imputation algorithm has limited application scope because it is compute-intensive and low-performance on CPU. In this paper, we propose a scheme to perform the convolutional imputation algorithm with higher time performance on GPUs (Graphics Processing Units) by exploiting multi-core GPUs of CUDA architecture. We propose optimization strategies to achieve coalesced memory access for graph Fourier transform (GFT) computation and improve the utilization of GPU SM resources for singular value decomposition (SVD) computation. Furthermore, we design a scheme to extend the GPU-optimized implementation to multiple GPUs for large-scale computing. Experimental results show that the GPU implementation is both fast and accurate. On synthetic data of varying sizes, the GPU-optimized implementation running on a single Quadro RTX6000 GPU achieves up to 60.50x speedups over the GPU-baseline implementation. The multi-GPU implementation achieves up to 1.81x speedups on two GPUs versus the GPU-optimized implementation on a single GPU. On the ego-Facebook dataset, the GPU-optimized implementation achieves up to 77.88x speedups over the GPU-baseline implementation. Meanwhile, the GPU implementation and the CPU implementation achieve similar, low recovery errors.

引用

页码：1 / 17

页数：17

共 27 条

[1] KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators [J].

Abdelfattah, Ahmad ;

Keyes, David ;

Ltaief, Hatem .

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2016, 42 (03)

[2]

Adomavicius G., 2015, Recommender Systems Handbook, P847, DOI [DOI 10.1007/978-1-4899-7637-6_25, 10.1007/978-1-4899-7637-6_25]

[3]

Caliper Corporation, 2019, TRANSMODELER TRAFFIC

[4]

Dathathri R, 2018, PROCEEDINGS OF THE 39TH ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, PLDI 2018, P752, DOI [10.1145/3192366.3192404, 10.1145/3296979.3192404]

[5]

Defferrard M., 2017, PyGSP: Graph signal processing in Python (v0.5.1)

[6]

Gori M, 2005, IEEE IJCNN, P729

[7] A Distributed Multi-GPU System for Fast Graph Processing [J].

Jia, Zhihao ;

Kwon, Yongkee ;

Shipman, Galen ;

McCormick, Pat ;

Erez, Mattan ;

Aiken, Alex .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 11 (03) :297-310

[8] Factorization strategies for third-order tensors [J].

Kilmer, Misha E. ;

Martin, Carla D. .

LINEAR ALGEBRA AND ITS APPLICATIONS, 2011, 435 (03) :641-658

[9]

LeCun Y, 2014, ICLR

[10] Phase Change Memory Architecture and the Quest for Scalability [J].

Lee, Benjamin C. ;

Ipek, Engin ;

Mutlu, Onur ;

Burger, Doug .

COMMUNICATIONS OF THE ACM, 2010, 53 (07) :99-106

← 1 2 3 →