Approximate Parallel High Utility Itemset Mining

被引:35
作者
Chen, Yan [1 ]
An, Aijun [1 ]
机构
[1] York Univ, Dept Elect Engn & Comp Sci, Toronto, ON, Canada
关键词
Approximate; Parallel; Hadoop; Spark; HUI; FREQUENT ITEMSETS;
D O I
10.1016/j.bdr.2016.07.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High utility itemset mining discovers itemsets whose utility is above a given threshold, where the utility measures the importance of an itemset. It overcomes the limitation of frequent pattern mining, which uses frequency as its quality measure. To speed up the performance for mining high utility itemsets, many algorithms have been proposed which usually focus on optimizing the candidate generation process. However, memory and time performance limitations still cause scalability issues, especially when the dataset is very large. In this paper, the problem is addressed by proposing a distributed parallel algorithm, PHUI-Miner, and a sampling strategy, which can be used either separately or simultaneously. PHUI-Miner parallelizes the state-of-the-art high utility itemset mining algorithm HUI-Miner. In PHUI-Miner, the search space of the high utility itemset mining problem is divided and assigned to nodes in a cluster, which splits the workload. The sampling strategy investigates the required sample size of a dataset, in order to achieve a given accuracy. The sample size is selected based on a new theorem, which provides a theoretical guarantee on the accuracy of results. We also propose an approach combining sampling with PHUI-Miner, which mines an approximate set of results, but could provide better time performance. In our experiments, we show that PHUI-Miner has high performance on different datasets and outperforms the state-of-the-art non-parallel algorithm HUI-Miner. The sampling strategy achieves accuracies much higher than the guarantee provided by the theorems in practice. Extensive experiments are also conducted to compare the time performance of PHUI-Miner with and without sampling. (C) 2016 Elsevier Inc. All rights reserved.
引用
收藏
页码:26 / 42
页数:17
相关论文
共 45 条
[1]  
Ada Wai-Chee Fu, 2000, Foundations of Intelligent Systems. 12th International Symposium, ISMIS 2000. Proceedings (Lecture Notes in Artificial Intelligence Vol.1932), P59
[2]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[3]  
Agrawal R., P 20 INT C VERY LARG
[4]   HUC-Prune: an efficient candidate pruning technique to mine high utility patterns [J].
Ahmed, Chowdhury Farhan ;
Tanbeer, Syed Khairuzzaman ;
Jeong, Byeong-Soo ;
Lee, Young-Koo .
APPLIED INTELLIGENCE, 2011, 34 (02) :181-198
[5]   Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases [J].
Ahmed, Chowdhury Farhan ;
Tanbeer, Syed Khairuzzaman ;
Jeong, Byeong-Soo ;
Lee, Young-Koo .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (12) :1708-1721
[6]  
[Anonymous], TECHNICAL REPORTS CI
[7]  
[Anonymous], 1994, KDD
[8]  
[Anonymous], 2012, NSDI
[9]   Extracting share frequent itemsets with infrequent subsets [J].
Barber, B ;
Hamilton, HJ .
DATA MINING AND KNOWLEDGE DISCOVERY, 2003, 7 (02) :153-185
[10]  
Bayardo R. J. Jr., 1998, SIGMOD Record, V27, P85, DOI 10.1145/276305.276313