TurboMGNN: Improving Concurrent GNN Training Tasks on GPU With Fine-Grained Kernel Fusion

被引：6

作者：

Wu, Wenchao ^{[1
]}

Shi, Xuanhua ^{[1
]}

He, Ligang ^{[2
]}

Jin, Hai ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Natl Engn Res Ctr Big Data Technol & Syst, Sch Comp Sci & Technol, Serv Comp Technol,Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China

[2] Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, England

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2023年 / 34卷 / 06期

基金：

国家重点研发计划;

关键词：

Task analysis; Training; Graphics processing units; Kernel; Graph neural networks; Computational modeling; Fuses; GNN training; concurrent multi-tasks; GPU; kernel fusion;

D O I：

10.1109/TPDS.2023.3267943

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Graph Neural Networks (GNN) have evolved as powerful models for graph representation learning. Many works have been proposed to support GNN training efficiently on GPU. However, these works only focus on a single GNN training task such as operator optimization, task scheduling, and programming model. Concurrent GNN training, which is needed in the applications such as neural network structure search, has not been explored yet. This work aims to improve the training efficiency of the concurrent GNN training tasks on GPU by developing fine-grained methods to fuse the kernels from different tasks. Specifically, we propose a fine-grained Sparse Matrix Multiplication (SpMM) based kernel fusion method to eliminate redundant accesses to graph data. In order to increase the fusion opportunity and reduce the synchronization cost, we further propose a novel technique to enable the fusion of the kernels in forward and backward propagation. Finally, in order to reduce the resource contention caused by the increased number of concurrent, heterogeneous GNN training tasks, we propose an adaptive strategy to group the tasks and match their operators according to resource contention. We have conducted extensive experiments, including kernel- and model-level benchmarks. The results show that the proposed methods can achieve up to 2.6X performance speedup.

引用

页码：1968 / 1981

页数：14

共 48 条

[1] Abadi M., 2016, 12 USENIX S OPERATIN, P265
[2] [Anonymous], 2022, PUBMED DATA
[3] [Anonymous], 2022, NVIDIA GPU PROGRAMMI
[4] [Anonymous], 2022, NODE PROPERTY PREDIC
[5] [Anonymous], 2020, EULER 2 0 DISTRIBUTE
[6] [Anonymous], 2022, REDDIT DATASETS
[7] [Anonymous], 2022, NVIDIA MULTIPROCESS
[8] [Anonymous], 2022, NVIDIA multi-instance GPU user guide
[9] Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs
Bai, Youhui
Li, Cheng
Lin, Zhiqi
Wu, Yufei
Miao, Youshan
Liu, Yunxin
Xu, Yinlong
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (10) : 2541 - 2556
[10] Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
Ben-Nun, Tal
Sutton, Michael
Pai, Sreepathi
Pingali, Keshav
[J]. ACM TRANSACTIONS ON PARALLEL COMPUTING, 2020, 7 (03)

← 1 2 3 4 5 →