An Algorithm Based on Dataflow Model for Mining Frequent Patterns from a Large Graph

被引:0
作者
Tang X.-C. [1 ]
Fan X.-F. [1 ]
Zhou J.-W. [1 ]
Li Z.-H. [1 ]
机构
[1] School of Computer Science, Northwestern Polytechnical University, Xi'an
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2020年 / 43卷 / 07期
关键词
Coding tree; Dataflow model; Frequent pattern; Graph mining; Parallel algorithm;
D O I
10.11897/SP.J.1016.2020.01293
中图分类号
学科分类号
摘要
Big graph data mining has been highly motivated not only by the tremendously increasing size of graphs but also by its large number of applications, such as bioinformatics, chemoinformatics, and social networks. One of the most challenging tasks in big graph mining is pattern mining. These tasks consist on using data mining algorithms to discover interesting, unexpected and useful patterns in large amounts of graph data. Several algorithms exist for frequent pattern mining, but they are mainly used on centralized computing systems and evaluated on relatively small datasets. While modern graphs are growing dramatically, several parallel and distributed solutions have been proposed to solve this problem. However, those methods do not have better performance in scalability and balancing. So that we propose an algorithm based on dataflow model for mining frequent patterns in a large single graph. We construct a dataflow model for Mining frequent patterns, which include three operators: IsFrequent, Expand and Code. At first, the frequent pattern mining method based on dataflow model separates large graph into many micro graphs and has following advantages. These micro graphs can be expanded and calculated simultaneously, because they are independent of each other. At the same time, since each iteration is based on the subgraph instance generated in the previous iteration, only one vertex or one edge needs to be extended, it decreases the generation of redundant subgraph in Expand operator. Secondly, we propose a regular code computing strategy based on invariant relation and an optimization strategy based on coding tree. These two approaches solve the problem that it is difficult to calculate the regular code. The results show that our regular code computing strategy improves performance by 30% over the original approach and our optimization strategy improves performance by 10% over the original strategy. Thirdly, we design operators of checking frequent pattern using micro batch data. After the large batch data is decomposed into multiple micro batch data, each micro batch data can be regarded as a single processing unit, a lot of tasks can be generated concurrently, which reduce data skew. These micro batch data can be iteratively computed more easily in parallel computing. And its iterative approach also satisfies the anti-monotonicity of frequent patterns mining. At last, the algorithm of frequent pattern mining is implemented. The experiments on our cluster show that the algorithm can effectively process a variety of large graphs with millions of vertices and tens of millions of frequent pattern mining, and scales well with the degree of available parallelism. © 2020, Science Press. All right reserved.
引用
收藏
页码:1293 / 1311
页数:18
相关论文
共 38 条
[1]  
Deshpande M, Kuramochi M, Karypis G., Frequent sub-structure-based approaches for classifying chemical compounds, Proceedings of the IEEE International Conference on Data Mining, pp. 35-42, (2003)
[2]  
Domshlak C, Brafman R I, Shimony E., Preference-based configuration of Web page content, Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1451-1456, (2001)
[3]  
Guralnik V, Karypis G., A scalable algorithm for clustering sequential data, Proceedings of the IEEE International Conference on Data Mining, pp. 179-186, (2001)
[4]  
Deutsch A, Fernandez M, Suciu D., Storing semistructured data with STORED, Sigmod Record, 28, 2, pp. 431-442, (1999)
[5]  
Yan X, Yu P S, Han J., Graph indexing: A frequent structure-based approach, Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 335-346, (2004)
[6]  
Cho Y R, Zhang A., Predicting protein function by frequent functional association pattern mining in protein interaction networks, IEEE Transactions on Information Technology in Biomedicine, 14, 1, pp. 30-36, (2010)
[7]  
Aridhi S, Nguifo E M., Big graph mining: Frameworks and techniques, Big Data Research, 6, pp. 1-10, (2016)
[8]  
Kuramochi M, Karypis G., Frequent subgraph discovery, Proceedings of the IEEE International Conference on Data Mining, pp. 313-320, (2002)
[9]  
Kuramochi M, Karypis G., Grew-A scalable frequent subgraph discovery algorithm, Proceedings of the 4th IEEE International Conference on Data Mining (ICDM'04), pp. 439-442, (2004)
[10]  
Inokuchi A, Washio T, Motoda H., Complete mining of frequent patterns from graphs: Mining graph data, Machine Learning, 50, 3, pp. 321-354, (2003)