A new algorithm for fast mining frequent itemsets using N-lists

被引:100
作者
Deng ZhiHong [1 ]
Wang ZhongHui [1 ]
Jiang JiaJian [1 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Minist Educ, Key Lab Machine Percept, Beijing 100871, Peoples R China
基金
中国国家自然科学基金; 国家高技术研究发展计划(863计划);
关键词
data mining; frequent itemset mining; data structure; N-lists; algorithm; PATTERNS; UTILITIES; SUPPORT;
D O I
10.1007/s11432-012-4638-z
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Mining frequent itemsets has emerged as a fundamental problem in data mining and plays an essential role in many important data mining tasks. In this paper, we propose a novel vertical data representation called N-list, which originates from an FP-tree-like coding prefix tree called PPC-tree that stores crucial information about frequent itemsets. Based on the N-list data structure, we develop an efficient mining algorithm, PrePost, for mining all frequent itemsets. Efficiency of PrePost is achieved by the following three reasons. First, N-list is compact since transactions with common prefixes share the same nodes of the PPC-tree. Second, the counting of itemsets' supports is transformed into the intersection of N-lists and the complexity of intersecting two N-lists can be reduced to O(m + n) by an efficient strategy, where m and n are the cardinalities of the two N-lists respectively. Third, PrePost can directly find frequent itemsets without generating candidate itemsets in some cases by making use of the single path property of N-list. We have experimentally evaluated PrePost against four state-of-the-art algorithms for mining frequent itemsets on a variety of real and synthetic datasets. The experimental results show that the PrePost algorithm is the fastest in most cases. Even though the algorithm consumes more memory when the datasets are sparse, it is still the fastest one.
引用
收藏
页码:2008 / 2030
页数:23
相关论文
共 31 条
  • [1] Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
  • [2] Agrawal R, 2003, 11 INT C DAT ENG ICD, P3
  • [3] Agrawal R., 1994, P 20 INT C VER LARG, P487, DOI DOI 10.5555/645920.672836
  • [4] An incremental data mining algorithm for compact realization of prototypes
    Ananthanarayana, VS
    Murty, MN
    Subramanian, DK
    [J]. PATTERN RECOGNITION, 2001, 34 (11) : 2249 - 2251
  • [5] [Anonymous], 2005, Algorithm Design
  • [6] IMine: Index Support for Item Set Mining
    Baralis, Elena
    Cerquitelli, Tania
    Chiusano, Silvia
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (04) : 493 - 506
  • [7] BAYARDO RJ, 1998, SIGMOD 98, P85, DOI DOI 10.1145/276304.276313
  • [8] MAFIA: A maximal frequent itemset algorithm
    Burdick, D
    Calimlim, M
    Flannick, J
    Gehrke, J
    Yiu, TM
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (11) : 1490 - 1504
  • [9] Mining impact-targeted activity patterns in imbalanced data
    Cao, Longbing
    Zhao, Yanchang
    Zhang, Chengqi
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (08) : 1053 - 1066
  • [10] Combined Mining: Discovering Informative Knowledge in Complex Data
    Cao, Longbing
    Zhang, Huaifeng
    Zhao, Yanchang
    Luo, Dan
    Zhang, Chengqi
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2011, 41 (03): : 699 - 712