Multi-Node Acceleration for Large-Scale GCNs

被引:4
作者
Sun, Gongjian [1 ,2 ]
Yan, Mingyu [1 ,2 ]
Wang, Duo [1 ,2 ]
Li, Han [1 ,2 ]
Li, Wenming [1 ,2 ]
Ye, Xiaochun [1 ,2 ]
Fan, Dongrui [1 ,2 ]
Xie, Yuan [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100045, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China
[3] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
基金
中国国家自然科学基金;
关键词
Deep learning; graph neural network; hardware accelerator; multi-node system; communication optimization; NETWORK;
D O I
10.1109/TC.2022.3207127
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Limited by the memory capacity and computation power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like tensor processing unit (TPU) Pod for large-scale neural network. In this work, we aim to scale up single-node GCN accelerator to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) irregular coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) the acceleration of GCNs in MultiAccSys is mainly bounded by network bandwidth but tolerates network latency. Guided by the above observations, we then propose MultiGCN, an efficient MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4 & SIM; 12x speedup using only 28%$\sim$& SIM;68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. Besides, MultiGCN not only achieves 2.5 & SIM; 8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graph as opposed to single-node GCN accelerators.
引用
收藏
页码:3140 / 3152
页数:13
相关论文
共 43 条
  • [1] A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing
    Ahn, Junwhan
    Hong, Sungpack
    Yoo, Sungjoo
    Mutlu, Onur
    Choi, Kiyoung
    [J]. 2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, : 105 - 117
  • [2] BILL D., 2020, GTC CHINA 2020 KEYNO
  • [3] Geometric Deep Learning Going beyond Euclidean data
    Bronstein, Michael M.
    Bruna, Joan
    LeCun, Yann
    Szlam, Arthur
    Vandergheynst, Pierre
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (04) : 18 - 42
  • [4] Rubik: A Hierarchical Architecture for Efficient Graph Neural Network Training
    Chen, Xiaobing
    Wang, Yuke
    Xie, Xinfeng
    Hu, Xing
    Basak, Abanti
    Liang, Ling
    Yan, Mingyu
    Deng, Lei
    Ding, Yufei
    Du, Zidong
    Xie, Yuan
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2022, 41 (04) : 936 - 949
  • [5] Iterative Visual Reasoning Beyond Convolutions
    Chen, Xinlei
    Li, Li-Jia
    Li Fei-Fei
    Gupta, Abhinav
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7239 - 7248
  • [6] Fey Matthias, 2019, ICLR WORKSHOP REPRES
  • [7] Geng T., 2021, MICRO54 54 ANN IEEEA, P1051
  • [8] AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing
    Geng, Tong
    Li, Ang
    Shi, Runbin
    Wu, Chunshu
    Wang, Tianqi
    Li, Yanfei
    Haghi, Pouya
    Tumeo, Antonino
    Che, Shuai
    Reinhardt, Steve
    Herbordt, Martin C.
    [J]. 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 922 - 936
  • [9] Ham TJ, 2016, INT SYMP MICROARCH
  • [10] Hamilton WL, 2017, ADV NEUR IN, V30