Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network

被引：2

作者：

Huang, Linyong ^{[1
]}

Zhang, Zhe ^{[2
]}

Li, Shuangchen ^{[2
]}

Niu, Dimin ^{[2
]}

Guan, Yijin ^{[2
]}

Zheng, Hongzhong ^{[2
]}

Xie, Yuan ^{[2
]}

机构：

[1] Zhejiang Univ, Coll Informat Sci & Elect Engn, Hangzhou 310058, Peoples R China

[2] Alibaba Grp, Hangzhou 311121, Peoples R China

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Graph neural network; large-scale graph processing; memory pool; near data processing;

D O I：

10.1109/ACCESS.2022.3169423

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Graph Neural Networks have drawn tremendous attention in the past few years due to their convincing performance and high interpretability in various graph-based tasks like link prediction and node classification. With the ever-growing graph size in the real world, especially for industrial graphs at a billion-level, the storage of graphs can easily consume Terabytes so that the process of GNNs has to be processed in a distributed manner. As a result, the execution could be inefficient due to the expensive cross-node communication and irregular memory access. Various GNN accelerators have been proposed for efficient GNN processing. They, however, mainly focused on small and medium-size graphs, which is not applicable to large-scale distributed graphs. In this paper, we present a practical Near-Data-Processing architecture based on a memory-pool system for large-scale distributed GNNs. We propose a customized memory fabric interface to construct the memory pool for low-latency and high throughput cross-node communication, which can provide flexible memory allocation and strong scalability. A practical Near-Data-Processing design is proposed for efficient work offloading and bandwidth utilization improvement. Moreover, we also introduce a partition and scheduling scheme to further improve performance and achieve workload balance. Comprehensive evaluations demonstrate that the proposed architecture can achieve up to 27 x and 8 x higher training speed compared to two state-of-the-art distributed GNN frameworks: Deep Graph Library and P-3, respectively.

引用

页码：46796 / 46807

页数：12

共 55 条

[1]

[Anonymous], SAMSUNG UNVEILS IND

[2]

ccixconsortium, CACHE COHERENT INTER

[3]

Chen X., 2020, ARXIV200912495

[4] FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining [J].

Chen, Xuhao ;

Huang, Tianhao ;

Xu, Shuotao ;

Bourgeat, Thomas ;

Chung, Chanwoo ;

Arvind .

2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, :581-594

[5] Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks [J].

Chen, Yu-Hsin ;

Emer, Joel ;

Sze, Vivienne .

2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :367-379

[6] DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning [J].

Chen, Yunji ;

Chen, Tianshi ;

Xu, Zhiwei ;

Sun, Ninghui ;

Temam, Olivier .

COMMUNICATIONS OF THE ACM, 2016, 59 (11) :105-112

[7] DaDianNao: A Machine-Learning Supercomputer [J].

Chen, Yunji ;

Luo, Tao ;

Liu, Shaoli ;

Zhang, Shijin ;

He, Liqiang ;

Wang, Jia ;

Li, Ling ;

Chen, Tianshi ;

Xu, Zhiwei ;

Sun, Ninghui ;

Temam, Olivier .

2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, :609-622

[8] Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks [J].

Chiang, Wei-Lin ;

Liu, Xuanqing ;

Si, Si ;

Li, Yang ;

Bengio, Samy ;

Hsieh, Cho-Jui .

KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, :257-266

[9] One Trillion Edges: Graph Processing at Facebook-Scale [J].

Ching, Avery ;

Edunov, Sergey ;

Kabiljo, Maja ;

Logothetis, Dionysios ;

Muthukrishnan, Sambavi .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12) :1804-1815

[10]

computeexpresslink, COMPUTED EXPERSS LIN

← 1 2 3 4 5 6 →