Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs

被引:21
作者
Zheng, Da [1 ]
Song, Xiang [1 ]
Yang, Chengru [2 ]
LaSalle, Dominique [3 ]
Karypis, George [1 ]
机构
[1] Amazon, Seattle, WA 98109 USA
[2] Amazon, Beijing, Peoples R China
[3] NVIDIA Corp, Boston, MA USA
来源
PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022 | 2022年
关键词
graph neural networks; distributed training;
D O I
10.1145/3534678.3539177
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large and heterogeneous, containing many millions or billions of vertices and edges of different types. To tackle this challenge, we develop DistDGLv2, a system that extends DistDGL for training GNNs on massive heterogeneous graphs in a mini-batch fashion, using distributed hybrid CPU/GPU training. DistDGLv2 places graph data in distributed CPU memory and performs mini-batch computation in GPUs. For ease of use, DistDGLv2 adopts API compatible with Deep Graph Library (DGL)'s mini-batch training and heterogeneous graph API, which enables distributed training with almost no code modification. To ensure model accuracy, DistDGLv2 follows a synchronous training approach and allows ego-networks forming mini-batches to include non-local vertices. To ensure data locality and load balancing, DistDGLv2 partitions heterogeneous graphs by using a multi-level partitioning algorithm with min-edge cut and multiple balancing constraints. DistDGLv2 deploys an asynchronous mini-batch generation pipeline that makes computation and data access asynchronous to fully utilize all hardware (CPU, GPU, network, PCIe). We demonstrate DistDGLv2 on various GNN workloads. Our results show that DistDGLv2 achieves 2 - 3x speedup over DistDGL and 18x speedup over Euler. It takes only 5 - 10 seconds to complete an epoch on graphs with hundreds of millions of vertices on a cluster with 64 GPUs.
引用
收藏
页码:4582 / 4591
页数:10
相关论文
共 27 条
[1]  
[Anonymous], 2018, ABS171010903 CORR
[2]  
Battaglia P.W., 2018, Relational inductive biases, deep learning, and graph networks
[3]  
Buluc A., 2020, ARXIV200503300
[4]   Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks [J].
Chiang, Wei-Lin ;
Liu, Xuanqing ;
Si, Si ;
Li, Yang ;
Bengio, Samy ;
Hsieh, Cho-Jui .
KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, :257-266
[5]  
Fey Matthias, 2019, ICLR WORKSHOP REPRES
[6]  
Gandhi S, 2021, PROCEEDINGS OF THE 15TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '21), P551
[7]  
Gilmer J, 2017, PR MACH LEARN RES, V70
[8]  
Hamilton WL, 2017, ADV NEUR IN, V30
[9]  
Hu W., 2021, ARXIV210309430, P2021
[10]  
Hu W., 2020, Advances in neural information processing systems, V33, P22118