PSGraph: How Tencent trains extremely large-scale graphs with Spark?

被引：12

作者：

Jiang, Jiawei ^{[1
]}

Xiao, Pin ^{[2
]}

Yu, Lele ^{[2
]}

Li, Xiaosen ^{[2
]}

Cheng, Jiefeng ^{[2
]}

Miao, Xupeng ^{[3
]}

Zhang, Zhipeng ^{[3
]}

Cui, Bin ^{[3
]}

机构：

[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland

[2] Tencent Inc, TEG, Data Platform, Shenzhen, Peoples R China

[3] Peking Univ, Sch EECS & MOE, Beijing, Peoples R China

来源：

2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020) | 2020年

关键词：

graph algorithm; Spark; parameter server;

D O I：

10.1109/ICDE48307.2020.00137

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spark has extensively used in many applications of Tencent, due to its easy deployment, pipeline capability, and close integration with the Hadoop ecosystem. As the graph computing engine of Spark, GraphX is also widely deployed to process large-scale graph data in Tencent. However, when the size of the graph data is up to billion-scale, GraphX encounters serious performance degradation. Worse, Graphx cannot support the rising advancement of graph embedding (GE) and graph neural network (GNN) algorithms. To address these challenges, we develop a new graph processing system, called PSGraph, which uses Spark executor and PyTorch to perform calculation, and develops a distributed parameter server to store frequently accessed models. PSGraph can train extremely large-scale graph data in Tencent with the parameter server architecture, and enable the training of GE and GNN algorithms. Moreover, PSGraph still benefits from the advantages of Spark via staying inside the Spark ecosystem, and can directly replace GraphX without modification to the existing application framework. Our experiments show that PSGraph outperforms GraphX significantly.

引用

页码：1549 / 1557

页数：9

共 50 条

[31] Adaptive Partitioning of Large-Scale Dynamic Graphs
Vaquero, Luis M.
Cuadrado, Felix
Logothetis, Dionysios
Martella, Claudio
2014 IEEE 34TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2014), 2014, : 144 - 153
[32] Large-scale text processing pipeline with Apache Spark
Svyatkovskiy, A.
Imai, K.
Kroeger, M.
Shiraito, Y.
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
[33] Hiding secret messages in large-scale graphs
Lee, Daewon
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 264
[34] Multilevel Parallelism for the Exploration of Large-Scale Graphs
Bernaschi, Massimo
Bisson, Mauro
Mastrostefano, Enrico
Vella, Flavio
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, 2018, 4 (03): : 204 - 216
[35] Gaussian Embedding of Large-Scale Attributed Graphs
Hettige, Bhagya
Li, Yuan-Fang
Wang, Weiqing
Buntine, Wray
DATABASES THEORY AND APPLICATIONS, ADC 2020, 2020, 12008 : 134 - 146
[36] Evolution of large-scale magnetosonic structures to trains of solitary waves
Strumik, M.
Stasiewicz, K.
Cheng, C. Z.
Thide, B.
JOURNAL OF GEOPHYSICAL RESEARCH-SPACE PHYSICS, 2011, 116
[37] ANGEL-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent
Nie, Xiaonan
Liu, Yi
Fu, Fangcheng
Xue, Jinbao
Jiao, Dian
Miao, Xupeng
Tao, Yangyu
Cui, Bin
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3781 - 3794
[38] Large-scale multi-label ensemble learning on Spark
Gonzalez-Lopez, Jorge
Cano, Alberto
Ventura, Sebastian
2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 893 - 900
[39] Accelerating Large-scale Image Retrieval on Heterogeneous Architectures with Spark
Wang, Hanli
Xiao, Bo
Wang, Lei
Wu, Jun
MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 1023 - 1026
[40] Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark
Thuong-Cang Phan
Anh-Cang Phan
Thi-To-Quyen Tran
Ngoan-Thanh Trieu
ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING (ICCSAMA 2019), 2020, 1121 : 391 - 402

← 1 2 3 4 5 →