An Algorithm Based on Dataflow Model for Mining Frequent Patterns from a Large Graph

被引:0
作者
Tang X.-C. [1 ]
Fan X.-F. [1 ]
Zhou J.-W. [1 ]
Li Z.-H. [1 ]
机构
[1] School of Computer Science, Northwestern Polytechnical University, Xi'an
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2020年 / 43卷 / 07期
关键词
Coding tree; Dataflow model; Frequent pattern; Graph mining; Parallel algorithm;
D O I
10.11897/SP.J.1016.2020.01293
中图分类号
学科分类号
摘要
Big graph data mining has been highly motivated not only by the tremendously increasing size of graphs but also by its large number of applications, such as bioinformatics, chemoinformatics, and social networks. One of the most challenging tasks in big graph mining is pattern mining. These tasks consist on using data mining algorithms to discover interesting, unexpected and useful patterns in large amounts of graph data. Several algorithms exist for frequent pattern mining, but they are mainly used on centralized computing systems and evaluated on relatively small datasets. While modern graphs are growing dramatically, several parallel and distributed solutions have been proposed to solve this problem. However, those methods do not have better performance in scalability and balancing. So that we propose an algorithm based on dataflow model for mining frequent patterns in a large single graph. We construct a dataflow model for Mining frequent patterns, which include three operators: IsFrequent, Expand and Code. At first, the frequent pattern mining method based on dataflow model separates large graph into many micro graphs and has following advantages. These micro graphs can be expanded and calculated simultaneously, because they are independent of each other. At the same time, since each iteration is based on the subgraph instance generated in the previous iteration, only one vertex or one edge needs to be extended, it decreases the generation of redundant subgraph in Expand operator. Secondly, we propose a regular code computing strategy based on invariant relation and an optimization strategy based on coding tree. These two approaches solve the problem that it is difficult to calculate the regular code. The results show that our regular code computing strategy improves performance by 30% over the original approach and our optimization strategy improves performance by 10% over the original strategy. Thirdly, we design operators of checking frequent pattern using micro batch data. After the large batch data is decomposed into multiple micro batch data, each micro batch data can be regarded as a single processing unit, a lot of tasks can be generated concurrently, which reduce data skew. These micro batch data can be iteratively computed more easily in parallel computing. And its iterative approach also satisfies the anti-monotonicity of frequent patterns mining. At last, the algorithm of frequent pattern mining is implemented. The experiments on our cluster show that the algorithm can effectively process a variety of large graphs with millions of vertices and tens of millions of frequent pattern mining, and scales well with the degree of available parallelism. © 2020, Science Press. All right reserved.
引用
收藏
页码:1293 / 1311
页数:18
相关论文
共 38 条
[11]  
Yan X, Han J., gSpan: Graph-based substructure pattern mining, Proceedings of the IEEE International Conference on Data Mining, (2002)
[12]  
Liu Y, Jiang X, Chen H, Et al., MapReduce-based pattern finding algorithm applied in motif detection for prescription compatibility network, Proceedings of the Advanced Parallel Processing Technologies, pp. 24-25, (2009)
[13]  
Shahrivari S, Jalili S., Distributed discovery of frequent subgraphs of a network using MapReduce, Computing, 97, 11, pp. 1101-1120, (2015)
[14]  
Kang U, Tsourakakis C E, Faloutsos C., PEGASUS: Mining peta-scale graphs, Knowledge & Information Systems, 27, 2, pp. 303-325, (2011)
[15]  
Wernicke S, Rasche F., FANMOD: A tool for fast network motif detection, Bioinformatics, 22, 9, pp. 1152-1153, (2006)
[16]  
Jiang C, Coenen F, Zito M., A survey of frequent subgraph mining algorithms, Knowledge Engineering Review, 28, 1, pp. 75-105, (2013)
[17]  
Suryawanshi S J, Kamalapur S M., Algorithms for frequent subgraph mining, International Journal Advanced Research in Computer and Communication Engineering, 2, 3, pp. 1545-1548, (2013)
[18]  
Agrawal R, Srikant R., Fast algorithms for mining association rules in large databases, Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487-499, (1994)
[19]  
Abdelhamid E, Abdelaziz I, Kalnis P, Et al., Scalemine: Scalable parallel frequent subgraph mining in a single large graph, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (2016)
[20]  
Di Fatta G, Berthold M R., Dynamic load balancing for the distributed mining of molecular structures, IEEE Transactions on Parallel and Distributed Systems, 17, 8, pp. 773-785, (2006)