TurboMGNN: Improving Concurrent GNN Training Tasks on GPU With Fine-Grained Kernel Fusion

被引:6
作者
Wu, Wenchao [1 ]
Shi, Xuanhua [1 ]
He, Ligang [2 ]
Jin, Hai [1 ]
机构
[1] Huazhong Univ Sci & Technol, Natl Engn Res Ctr Big Data Technol & Syst, Sch Comp Sci & Technol, Serv Comp Technol,Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China
[2] Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, England
基金
国家重点研发计划;
关键词
Task analysis; Training; Graphics processing units; Kernel; Graph neural networks; Computational modeling; Fuses; GNN training; concurrent multi-tasks; GPU; kernel fusion;
D O I
10.1109/TPDS.2023.3267943
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Graph Neural Networks (GNN) have evolved as powerful models for graph representation learning. Many works have been proposed to support GNN training efficiently on GPU. However, these works only focus on a single GNN training task such as operator optimization, task scheduling, and programming model. Concurrent GNN training, which is needed in the applications such as neural network structure search, has not been explored yet. This work aims to improve the training efficiency of the concurrent GNN training tasks on GPU by developing fine-grained methods to fuse the kernels from different tasks. Specifically, we propose a fine-grained Sparse Matrix Multiplication (SpMM) based kernel fusion method to eliminate redundant accesses to graph data. In order to increase the fusion opportunity and reduce the synchronization cost, we further propose a novel technique to enable the fusion of the kernels in forward and backward propagation. Finally, in order to reduce the resource contention caused by the increased number of concurrent, heterogeneous GNN training tasks, we propose an adaptive strategy to group the tasks and match their operators according to resource contention. We have conducted extensive experiments, including kernel- and model-level benchmarks. The results show that the proposed methods can achieve up to 2.6X performance speedup.
引用
收藏
页码:1968 / 1981
页数:14
相关论文
共 48 条
  • [1] Abadi M., 2016, 12 USENIX S OPERATIN, P265
  • [2] [Anonymous], 2022, PUBMED DATA
  • [3] [Anonymous], 2022, NVIDIA GPU PROGRAMMI
  • [4] [Anonymous], 2022, NODE PROPERTY PREDIC
  • [5] [Anonymous], 2020, EULER 2 0 DISTRIBUTE
  • [6] [Anonymous], 2022, REDDIT DATASETS
  • [7] [Anonymous], 2022, NVIDIA MULTIPROCESS
  • [8] [Anonymous], 2022, NVIDIA multi-instance GPU user guide
  • [9] Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs
    Bai, Youhui
    Li, Cheng
    Lin, Zhiqi
    Wu, Yufei
    Miao, Youshan
    Liu, Yunxin
    Xu, Yinlong
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (10) : 2541 - 2556
  • [10] Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
    Ben-Nun, Tal
    Sutton, Michael
    Pai, Sreepathi
    Pingali, Keshav
    [J]. ACM TRANSACTIONS ON PARALLEL COMPUTING, 2020, 7 (03)