Towards Online Graph Processing with Spark Streaming

被引：0

作者：

Abughofa, Tariq ^{[1
]}

Zulkernine, Farhana ^{[1
]}

机构：

[1] Queens Univ, Sch Comp, Kingston, ON, Canada

来源：

2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2017年

关键词：

Real-time processing; Spark; RDD; IndexedRDD; Redis; Graph processing; Stream processing;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Graph processing is one of the most important topics in big data processing. The graph architecture is suitable for distributed processing as the processing works in an iterative manner allowing parallelism. Also, the structure has proved to be suitable in representing social networks, web page indexes, and many other problems. However, graph processing introduce many problems as well. Partitioning the graph to distribute the data on multiple machines and minimizing data movement is a serious challenge. Also many of the graph algorithms have high complexity. GraphX is one of the frameworks that introduce an abstraction on top of Spark, an iterative data processing engine. However, GraphX and other novel graph abstractions still do not support processing data streams with online graphs. In this work we try to use IndexedRDD, a library to enable fine grained updates as a key-value store on top of Spark to represent a graph structure and test if it can be used as an efficient online graph storage for spark streaming. We did experiments to compare our data streaming implementation using IndexedRDD with the obvious elementary solution of using RDD transformations to join the old RDD with the new one to make a new composite RDD on each micro-batch. We also want to compare the above two with a distributed in-memory key-value store (such as Redis). The results show big advantage of using Redis over RDD transformations and IndexedRDD. However, it has some limitations such as lacking the support for property graphs. IndexedRDD, on the other hand, has shown good performance for insertions and a shortcoming in its need to rebuild the index after each data update, which add extra time on each lookup that cannot be tolerated when lookup speed is essential.

引用

页码：2787 / 2794

页数：8

共 16 条

[1] [Anonymous], 2017, REDIS CLUSTER SPECIF
[2] [Anonymous], 2012, CASSOVARY BIG GRAPH
[3] [Anonymous], 2012, TWITTER STREAMING AP
[4] [Anonymous], 2014, OSDI 14
[5] One Trillion Edges: Graph Processing at Facebook-Scale
Ching, Avery
Edunov, Sergey
Kabiljo, Maja
Logothetis, Dionysios
Muthukrishnan, Sambavi
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1804 - 1815
[6] Dave A., 2015, INDEXEDRDD EFFICEINT
[7] Data management in cloud environments: NoSQL and NewSQL data stores
Grolinger K.
Higashino W.A.
Tiwari A.
Capretz M.A.M.
[J]. Journal of Cloud Computing: Advances, Systems and Applications, 2 (1):
[8] Gupta P., 2013, WTF WHO FOLLOW SERVI
[9] Leis V, 2013, PROC INT CONF DATA, P38, DOI 10.1109/ICDE.2013.6544812
[10] Malewicz G., 2010, P 2010 ACM SIGMOD IN, P135, DOI [DOI 10.1145/1807167.1807184, 10.1145/1807167.1807184]

← 1 2 →