PSGraph: How Tencent trains extremely large-scale graphs with Spark?

被引:12
|
作者
Jiang, Jiawei [1 ]
Xiao, Pin [2 ]
Yu, Lele [2 ]
Li, Xiaosen [2 ]
Cheng, Jiefeng [2 ]
Miao, Xupeng [3 ]
Zhang, Zhipeng [3 ]
Cui, Bin [3 ]
机构
[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland
[2] Tencent Inc, TEG, Data Platform, Shenzhen, Peoples R China
[3] Peking Univ, Sch EECS & MOE, Beijing, Peoples R China
关键词
graph algorithm; Spark; parameter server;
D O I
10.1109/ICDE48307.2020.00137
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spark has extensively used in many applications of Tencent, due to its easy deployment, pipeline capability, and close integration with the Hadoop ecosystem. As the graph computing engine of Spark, GraphX is also widely deployed to process large-scale graph data in Tencent. However, when the size of the graph data is up to billion-scale, GraphX encounters serious performance degradation. Worse, Graphx cannot support the rising advancement of graph embedding (GE) and graph neural network (GNN) algorithms. To address these challenges, we develop a new graph processing system, called PSGraph, which uses Spark executor and PyTorch to perform calculation, and develops a distributed parameter server to store frequently accessed models. PSGraph can train extremely large-scale graph data in Tencent with the parameter server architecture, and enable the training of GE and GNN algorithms. Moreover, PSGraph still benefits from the advantages of Spark via staying inside the Spark ecosystem, and can directly replace GraphX without modification to the existing application framework. Our experiments show that PSGraph outperforms GraphX significantly.
引用
收藏
页码:1549 / 1557
页数:9
相关论文
共 50 条
  • [31] Adaptive Partitioning of Large-Scale Dynamic Graphs
    Vaquero, Luis M.
    Cuadrado, Felix
    Logothetis, Dionysios
    Martella, Claudio
    2014 IEEE 34TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2014), 2014, : 144 - 153
  • [32] Large-scale text processing pipeline with Apache Spark
    Svyatkovskiy, A.
    Imai, K.
    Kroeger, M.
    Shiraito, Y.
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
  • [34] Multilevel Parallelism for the Exploration of Large-Scale Graphs
    Bernaschi, Massimo
    Bisson, Mauro
    Mastrostefano, Enrico
    Vella, Flavio
    IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, 2018, 4 (03): : 204 - 216
  • [35] Gaussian Embedding of Large-Scale Attributed Graphs
    Hettige, Bhagya
    Li, Yuan-Fang
    Wang, Weiqing
    Buntine, Wray
    DATABASES THEORY AND APPLICATIONS, ADC 2020, 2020, 12008 : 134 - 146
  • [36] Evolution of large-scale magnetosonic structures to trains of solitary waves
    Strumik, M.
    Stasiewicz, K.
    Cheng, C. Z.
    Thide, B.
    JOURNAL OF GEOPHYSICAL RESEARCH-SPACE PHYSICS, 2011, 116
  • [37] ANGEL-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent
    Nie, Xiaonan
    Liu, Yi
    Fu, Fangcheng
    Xue, Jinbao
    Jiao, Dian
    Miao, Xupeng
    Tao, Yangyu
    Cui, Bin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3781 - 3794
  • [38] Large-scale multi-label ensemble learning on Spark
    Gonzalez-Lopez, Jorge
    Cano, Alberto
    Ventura, Sebastian
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 893 - 900
  • [39] Accelerating Large-scale Image Retrieval on Heterogeneous Architectures with Spark
    Wang, Hanli
    Xiao, Bo
    Wang, Lei
    Wu, Jun
    MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 1023 - 1026
  • [40] Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark
    Thuong-Cang Phan
    Anh-Cang Phan
    Thi-To-Quyen Tran
    Ngoan-Thanh Trieu
    ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING (ICCSAMA 2019), 2020, 1121 : 391 - 402