PSGraph: How Tencent trains extremely large-scale graphs with Spark?

被引：12

作者：

Jiang, Jiawei ^{[1
]}

Xiao, Pin ^{[2
]}

Yu, Lele ^{[2
]}

Li, Xiaosen ^{[2
]}

Cheng, Jiefeng ^{[2
]}

Miao, Xupeng ^{[3
]}

Zhang, Zhipeng ^{[3
]}

Cui, Bin ^{[3
]}

机构：

[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland

[2] Tencent Inc, TEG, Data Platform, Shenzhen, Peoples R China

[3] Peking Univ, Sch EECS & MOE, Beijing, Peoples R China

来源：

2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020) | 2020年

关键词：

graph algorithm; Spark; parameter server;

D O I：

10.1109/ICDE48307.2020.00137

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spark has extensively used in many applications of Tencent, due to its easy deployment, pipeline capability, and close integration with the Hadoop ecosystem. As the graph computing engine of Spark, GraphX is also widely deployed to process large-scale graph data in Tencent. However, when the size of the graph data is up to billion-scale, GraphX encounters serious performance degradation. Worse, Graphx cannot support the rising advancement of graph embedding (GE) and graph neural network (GNN) algorithms. To address these challenges, we develop a new graph processing system, called PSGraph, which uses Spark executor and PyTorch to perform calculation, and develops a distributed parameter server to store frequently accessed models. PSGraph can train extremely large-scale graph data in Tencent with the parameter server architecture, and enable the training of GE and GNN algorithms. Moreover, PSGraph still benefits from the advantages of Spark via staying inside the Spark ecosystem, and can directly replace GraphX without modification to the existing application framework. Our experiments show that PSGraph outperforms GraphX significantly.

引用

页码：1549 / 1557

页数：9

共 50 条

[41] A Large-Scale Filter Method for Feature Selection Based on Spark
Marone, Reine Marie
Camara, Fode
Ndiaye, Samba
2017 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI), 2017, : 16 - 20
[42] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
[43] Filter Large-scale Engine Data using Apache Spark
Pirozzi, Donato
Scarano, Vittorio
Begg, Steven
De Sercey, Guillaume
Fish, Andrew
Harvey, Andrew
2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305
[44] Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence
Capuccini, Marco
Carlsson, Lars
Norinder, Ulf
Spjuth, Ola
2015 IEEE/ACM 2ND INTERNATIONAL SYMPOSIUM ON BIG DATA COMPUTING (BDC), 2015, : 61 - 67
[45] Scalable Motif Counting for Large-scale Temporal Graphs
Gao, Zhongqiang
Cheng, Chuanqi
Yu, Yanwei
Cao, Lei
Huang, Chao
Dong, Junyu
2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2656 - 2668
[46] LargeEA: Aligning Entities for Large-scale Knowledge Graphs
Ge, Congcong
Liu, Xiaoze
Chen, Lu
Gao, Yunjun
Zheng, Baihua
PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 15 (02): : 237 - 245
[47] ALLIE: Active Learning on Large-scale Imbalanced Graphs
Cui, Limeng
Tang, Xianfeng
Katariya, Sumeet
Rao, Nikhil
Agrawal, Pallav
Subbian, Karthik
Lee, Dongwon
PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 690 - 698
[48] The Use of Weighted Graphs for Large-Scale Genome Analysis
Zhou, Fang
Toivonen, Hannu
King, Ross D.
PLOS ONE, 2014, 9 (03):
[49] Particle Swarm Optimization for Large-Scale Clustering on Apache Spark
Sherar, Matthew
Zulkernine, Farhana
2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 801 - 808
[50] A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark
Phan A.-C.
Phan T.-C.
Trieu T.-N.
Tran T.-T.-Q.
SN Computer Science, 2021, 2 (5)

← 1 2 3 4 5 →