PSGraph: How Tencent trains extremely large-scale graphs with Spark?

被引：12

作者：

Jiang, Jiawei ^{[1
]}

Xiao, Pin ^{[2
]}

Yu, Lele ^{[2
]}

Li, Xiaosen ^{[2
]}

Cheng, Jiefeng ^{[2
]}

Miao, Xupeng ^{[3
]}

Zhang, Zhipeng ^{[3
]}

Cui, Bin ^{[3
]}

机构：

[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland

[2] Tencent Inc, TEG, Data Platform, Shenzhen, Peoples R China

[3] Peking Univ, Sch EECS & MOE, Beijing, Peoples R China

来源：

2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020) | 2020年

关键词：

graph algorithm; Spark; parameter server;

D O I：

10.1109/ICDE48307.2020.00137

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spark has extensively used in many applications of Tencent, due to its easy deployment, pipeline capability, and close integration with the Hadoop ecosystem. As the graph computing engine of Spark, GraphX is also widely deployed to process large-scale graph data in Tencent. However, when the size of the graph data is up to billion-scale, GraphX encounters serious performance degradation. Worse, Graphx cannot support the rising advancement of graph embedding (GE) and graph neural network (GNN) algorithms. To address these challenges, we develop a new graph processing system, called PSGraph, which uses Spark executor and PyTorch to perform calculation, and develops a distributed parameter server to store frequently accessed models. PSGraph can train extremely large-scale graph data in Tencent with the parameter server architecture, and enable the training of GE and GNN algorithms. Moreover, PSGraph still benefits from the advantages of Spark via staying inside the Spark ecosystem, and can directly replace GraphX without modification to the existing application framework. Our experiments show that PSGraph outperforms GraphX significantly.

引用

页码：1549 / 1557

页数：9

共 50 条

[21] Appraising SPARK on Large-Scale Social Media Analysis
Belcastro, Loris
Marozzo, Fabrizio
Talia, Domenico
Trunfio, Paolo
EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 483 - 495
[22] Readable representations for large-scale bipartite graphs
Sato, Shuji
Misue, Kazuo
Tanaka, Jiro
KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 2, PROCEEDINGS, 2008, 5178 : 831 - 838
[23] Efficient Machine Learning On Large-Scale Graphs
Erickson, Parker
Lee, Victor E.
Shi, Feng
Tang, Jiliang
PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 4788 - 4789
[24] Understanding Coarsening for Embedding Large-Scale Graphs
Akyildiz, Taha Atahan
Aljundi, Amro Alabsi
Kaya, Kamer
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2937 - 2946
[25] Generating Large-Scale Heterogeneous Graphs for Benchmarking
Gupta, Amarnath
SPECIFYING BIG DATA BENCHMARKS, 2014, 8163 : 113 - 128
[26] Parallelism and Partitioning in Large-Scale GAs using Spark
Alterkawi, Laila
Migliavacca, Matteo
PROCEEDINGS OF THE 2019 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'19), 2019, : 736 - 744
[27] Efficient mining algorithms for large-scale graphs
Kishimoto, Yasunari
Shiokawa, Hiroaki
Fujiwara, Yasuhiro
Onizuka, Makoto
NTT Technical Review, 2013, 11 (12):
[28] Parallel generation of large-scale random graphs
Vullikanti, Anil
2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 278 - 278
[29] Large-scale Machine Learning over Graphs
Yang, Yiming
PROCEEDINGS OF THE 2018 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'18), 2018, : 9 - 9
[30] Large-scale quantum networks based on graphs
Epping, Michael
Kampermann, Hermann
Bruss, Dagmar
NEW JOURNAL OF PHYSICS, 2016, 18

← 1 2 3 4 5 →