Adaptive Parallel Training for Graph Neural Networks

被引:0
作者
Ma, Kaihao [1 ,5 ]
Liu, Renjie [2 ,5 ]
Yan, Xiao [3 ]
Cai, Zhenkun [4 ]
Song, Xiang [4 ]
Wang, Minjie [5 ]
Li, Yichao [1 ]
Cheng, James [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Southern Univ Sci & Technol, Shenzhen, Peoples R China
[3] Ctr Perceptual & Interact Intelligence, Hong Kong, Peoples R China
[4] Amazon, Seattle, WA USA
[5] AWS Shanghai AI Lab, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 2025 THE 30TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2025 | 2025年
关键词
Graph Neural Networks; Distributed and Parallel Training; Network Communication;
D O I
10.1145/3710848.3710883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://github.com/kaihaoma/APT.
引用
收藏
页码:29 / 42
页数:14
相关论文
共 56 条
  • [1] Cai Zhenkun, 2023, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, P392, DOI 10.1145/3572848.3577528
  • [2] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
    Cai, Zhenkun
    Yan, Xiao
    Ma, Kaihao
    Wu, Yidi
    Huang, Yuzhen
    Cheng, James
    Su, Teng
    Yu, Fan
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (08) : 1967 - 1981
  • [3] DGCL: An Efficient Communication Library for Distributed GNN Training
    Cai, Zhenkun
    Yan, Xiao
    Wu, Yidi
    Ma, Kaihao
    Cheng, James
    Yu, Fan
    [J]. PROCEEDINGS OF THE SIXTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS '21), 2021, : 130 - 144
  • [4] Chen Jianfei, 2018, P 35 INT C MACH LEAR, P941
  • [5] Bipartite-Oriented Distributed Graph Partitioning for Big Learning
    Chen, Rong
    Shi, Jia-Xin
    Chen, Hai-Bo
    Zang, Bin-Yu
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2015, 30 (01) : 20 - 29
  • [6] How Do the Open Source Communities Address Usability and UX Issues? An Exploratory Study
    Cheng, Jinghui
    Guo, Jin L. C.
    [J]. CHI 2018: EXTENDED ABSTRACTS OF THE 2018 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2018,
  • [7] DGL, 2023, Deep Graph library
  • [8] DGL, 2023, DGL Graph Partitioning Tool
  • [9] Fey M., 2019, ICLR WORKSHOPS
  • [10] Gandhi S, 2021, PROCEEDINGS OF THE 15TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '21), P551