Adaptive Parallel Training for Graph Neural Networks

被引：0

作者：

Ma, Kaihao ^{[1
,5
]}

Liu, Renjie ^{[2
,5
]}

Yan, Xiao ^{[3
]}

Cai, Zhenkun ^{[4
]}

Song, Xiang ^{[4
]}

Wang, Minjie ^{[5
]}

Li, Yichao ^{[1
]}

Cheng, James ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[2] Southern Univ Sci & Technol, Shenzhen, Peoples R China

[3] Ctr Perceptual & Interact Intelligence, Hong Kong, Peoples R China

[4] Amazon, Seattle, WA USA

[5] AWS Shanghai AI Lab, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 2025 THE 30TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2025 | 2025年

关键词：

Graph Neural Networks; Distributed and Parallel Training; Network Communication;

D O I：

10.1145/3710848.3710883

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://github.com/kaihaoma/APT.

引用

页码：29 / 42

页数：14

共 56 条

[1] Cai Zhenkun, 2023, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, P392, DOI 10.1145/3572848.3577528
[2] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
Cai, Zhenkun
Yan, Xiao
Ma, Kaihao
Wu, Yidi
Huang, Yuzhen
Cheng, James
Su, Teng
Yu, Fan
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (08) : 1967 - 1981
[3] DGCL: An Efficient Communication Library for Distributed GNN Training
Cai, Zhenkun
Yan, Xiao
Wu, Yidi
Ma, Kaihao
Cheng, James
Yu, Fan
[J]. PROCEEDINGS OF THE SIXTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS '21), 2021, : 130 - 144
[4] Chen Jianfei, 2018, P 35 INT C MACH LEAR, P941
[5] Bipartite-Oriented Distributed Graph Partitioning for Big Learning
Chen, Rong
Shi, Jia-Xin
Chen, Hai-Bo
Zang, Bin-Yu
[J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2015, 30 (01) : 20 - 29
[6] How Do the Open Source Communities Address Usability and UX Issues? An Exploratory Study
Cheng, Jinghui
Guo, Jin L. C.
[J]. CHI 2018: EXTENDED ABSTRACTS OF THE 2018 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2018,
[7] DGL, 2023, Deep Graph library
[8] DGL, 2023, DGL Graph Partitioning Tool
[9] Fey M., 2019, ICLR WORKSHOPS
[10] Gandhi S, 2021, PROCEEDINGS OF THE 15TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '21), P551

← 1 2 3 4 5 6 →