A new algorithm for fast mining frequent itemsets using N-lists

被引：104

作者：

Deng ZhiHong ^{[1
]}

Wang ZhongHui ^{[1
]}

Jiang JiaJian ^{[1
]}

机构：

[1] Peking Univ, Sch Elect Engn & Comp Sci, Minist Educ, Key Lab Machine Percept, Beijing 100871, Peoples R China

来源：

SCIENCE CHINA-INFORMATION SCIENCES | 2012年 / 55卷 / 09期

基金：

中国国家自然科学基金; 国家高技术研究发展计划(863计划);

关键词：

data mining; frequent itemset mining; data structure; N-lists; algorithm; PATTERNS; UTILITIES; SUPPORT;

D O I：

10.1007/s11432-012-4638-z

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Mining frequent itemsets has emerged as a fundamental problem in data mining and plays an essential role in many important data mining tasks. In this paper, we propose a novel vertical data representation called N-list, which originates from an FP-tree-like coding prefix tree called PPC-tree that stores crucial information about frequent itemsets. Based on the N-list data structure, we develop an efficient mining algorithm, PrePost, for mining all frequent itemsets. Efficiency of PrePost is achieved by the following three reasons. First, N-list is compact since transactions with common prefixes share the same nodes of the PPC-tree. Second, the counting of itemsets' supports is transformed into the intersection of N-lists and the complexity of intersecting two N-lists can be reduced to O(m + n) by an efficient strategy, where m and n are the cardinalities of the two N-lists respectively. Third, PrePost can directly find frequent itemsets without generating candidate itemsets in some cases by making use of the single path property of N-list. We have experimentally evaluated PrePost against four state-of-the-art algorithms for mining frequent itemsets on a variety of real and synthetic datasets. The experimental results show that the PrePost algorithm is the fastest in most cases. Even though the algorithm consumes more memory when the datasets are sparse, it is still the fastest one.

引用

页码：2008 / 2030

页数：23

共 31 条

[1]

Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072

[2]

Agrawal R, 2003, 11 INT C DAT ENG ICD, P3

[3]

Agrawal R., 1994, P 20 INT C VER LARG, P487, DOI DOI 10.5555/645920.672836

[4] An incremental data mining algorithm for compact realization of prototypes [J].

Ananthanarayana, VS ;

Murty, MN ;

Subramanian, DK .

PATTERN RECOGNITION, 2001, 34 (11) :2249-2251

[5]

[Anonymous], 2005, Algorithm Design

[6] IMine: Index Support for Item Set Mining [J].

Baralis, Elena ;

Cerquitelli, Tania ;

Chiusano, Silvia .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (04) :493-506

[7]

BAYARDO RJ, 1998, SIGMOD 98, P85, DOI DOI 10.1145/276304.276313

[8] MAFIA: A maximal frequent itemset algorithm [J].

Burdick, D ;

Calimlim, M ;

Flannick, J ;

Gehrke, J ;

Yiu, TM .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (11) :1490-1504

[9] Mining impact-targeted activity patterns in imbalanced data [J].

Cao, Longbing ;

Zhao, Yanchang ;

Zhang, Chengqi .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (08) :1053-1066

[10] Combined Mining: Discovering Informative Knowledge in Complex Data [J].

Cao, Longbing ;

Zhang, Huaifeng ;

Zhao, Yanchang ;

Luo, Dan ;

Zhang, Chengqi .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2011, 41 (03) :699-712

← 1 2 3 4 →