Multi-Node Acceleration for Large-Scale GCNs

被引：4

作者：

Sun, Gongjian ^{[1
,2
]}

Yan, Mingyu ^{[1
,2
]}

Wang, Duo ^{[1
,2
]}

Li, Han ^{[1
,2
]}

Li, Wenming ^{[1
,2
]}

Ye, Xiaochun ^{[1
,2
]}

Fan, Dongrui ^{[1
,2
]}

Xie, Yuan ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100045, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China

[3] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2022年 / 71卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Deep learning; graph neural network; hardware accelerator; multi-node system; communication optimization; NETWORK;

D O I：

10.1109/TC.2022.3207127

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Limited by the memory capacity and computation power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like tensor processing unit (TPU) Pod for large-scale neural network. In this work, we aim to scale up single-node GCN accelerator to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) irregular coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) the acceleration of GCNs in MultiAccSys is mainly bounded by network bandwidth but tolerates network latency. Guided by the above observations, we then propose MultiGCN, an efficient MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4 & SIM; 12x speedup using only 28%$\sim$& SIM;68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. Besides, MultiGCN not only achieves 2.5 & SIM; 8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graph as opposed to single-node GCN accelerators.

引用

页码：3140 / 3152

页数：13

共 43 条

[1] A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing
Ahn, Junwhan
Hong, Sungpack
Yoo, Sungjoo
Mutlu, Onur
Choi, Kiyoung
[J]. 2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, : 105 - 117
[2] BILL D., 2020, GTC CHINA 2020 KEYNO
[3] Geometric Deep Learning Going beyond Euclidean data
Bronstein, Michael M.
Bruna, Joan
LeCun, Yann
Szlam, Arthur
Vandergheynst, Pierre
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (04) : 18 - 42
[4] Rubik: A Hierarchical Architecture for Efficient Graph Neural Network Training
Chen, Xiaobing
Wang, Yuke
Xie, Xinfeng
Hu, Xing
Basak, Abanti
Liang, Ling
Yan, Mingyu
Deng, Lei
Ding, Yufei
Du, Zidong
Xie, Yuan
[J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2022, 41 (04) : 936 - 949
[5] Iterative Visual Reasoning Beyond Convolutions
Chen, Xinlei
Li, Li-Jia
Li Fei-Fei
Gupta, Abhinav
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7239 - 7248
[6] Fey Matthias, 2019, ICLR WORKSHOP REPRES
[7] Geng T., 2021, MICRO54 54 ANN IEEEA, P1051
[8] AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing
Geng, Tong
Li, Ang
Shi, Runbin
Wu, Chunshu
Wang, Tianqi
Li, Yanfei
Haghi, Pouya
Tumeo, Antonino
Che, Shuai
Reinhardt, Steve
Herbordt, Martin C.
[J]. 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 922 - 936
[9] Ham TJ, 2016, INT SYMP MICROARCH
[10] Hamilton WL, 2017, ADV NEUR IN, V30

← 1 2 3 4 5 →