Towards Online Graph Processing with Spark Streaming

被引:0
作者
Abughofa, Tariq [1 ]
Zulkernine, Farhana [1 ]
机构
[1] Queens Univ, Sch Comp, Kingston, ON, Canada
来源
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2017年
关键词
Real-time processing; Spark; RDD; IndexedRDD; Redis; Graph processing; Stream processing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Graph processing is one of the most important topics in big data processing. The graph architecture is suitable for distributed processing as the processing works in an iterative manner allowing parallelism. Also, the structure has proved to be suitable in representing social networks, web page indexes, and many other problems. However, graph processing introduce many problems as well. Partitioning the graph to distribute the data on multiple machines and minimizing data movement is a serious challenge. Also many of the graph algorithms have high complexity. GraphX is one of the frameworks that introduce an abstraction on top of Spark, an iterative data processing engine. However, GraphX and other novel graph abstractions still do not support processing data streams with online graphs. In this work we try to use IndexedRDD, a library to enable fine grained updates as a key-value store on top of Spark to represent a graph structure and test if it can be used as an efficient online graph storage for spark streaming. We did experiments to compare our data streaming implementation using IndexedRDD with the obvious elementary solution of using RDD transformations to join the old RDD with the new one to make a new composite RDD on each micro-batch. We also want to compare the above two with a distributed in-memory key-value store (such as Redis). The results show big advantage of using Redis over RDD transformations and IndexedRDD. However, it has some limitations such as lacking the support for property graphs. IndexedRDD, on the other hand, has shown good performance for insertions and a shortcoming in its need to rebuild the index after each data update, which add extra time on each lookup that cannot be tolerated when lookup speed is essential.
引用
收藏
页码:2787 / 2794
页数:8
相关论文
共 16 条
  • [1] [Anonymous], 2017, REDIS CLUSTER SPECIF
  • [2] [Anonymous], 2012, CASSOVARY BIG GRAPH
  • [3] [Anonymous], 2012, TWITTER STREAMING AP
  • [4] [Anonymous], 2014, OSDI 14
  • [5] One Trillion Edges: Graph Processing at Facebook-Scale
    Ching, Avery
    Edunov, Sergey
    Kabiljo, Maja
    Logothetis, Dionysios
    Muthukrishnan, Sambavi
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1804 - 1815
  • [6] Dave A., 2015, INDEXEDRDD EFFICEINT
  • [7] Data management in cloud environments: NoSQL and NewSQL data stores
    Grolinger K.
    Higashino W.A.
    Tiwari A.
    Capretz M.A.M.
    [J]. Journal of Cloud Computing: Advances, Systems and Applications, 2 (1):
  • [8] Gupta P., 2013, WTF WHO FOLLOW SERVI
  • [9] Leis V, 2013, PROC INT CONF DATA, P38, DOI 10.1109/ICDE.2013.6544812
  • [10] Malewicz G., 2010, P 2010 ACM SIGMOD IN, P135, DOI [DOI 10.1145/1807167.1807184, 10.1145/1807167.1807184]