Pincer-search: An efficient algorithm for discovering the maximum frequent set

被引：51

作者：

Lin, DI

Kedem, ZM

机构：

[1] Redback Networks Inc, San Jose, CA 95134 USA

[2] NYU, Dept Comp Sci, New York, NY 10012 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2002年 / 14卷 / 03期

基金：

美国国家科学基金会;

关键词：

data mining; knowledge discovery; association rule; maximum frequent set; Pincer Search; maximum frequent candidate set;

D O I：

10.1109/TKDE.2002.1000342

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Discovering frequent itemsets is a key problem in important data mining applications, such as the discovery of association rules, strong rules, episodes, and minimal keys. Typical algorithms for solving this problem operate in a bottom-up, breadth-first search direction. The computation starts from frequent 1-itemsets (the minimum length frequent itemsets) and continues until all maximal (length) frequent itemsets are found. During the execution, every frequent itemset is explicitly considered. Such algorithms perform well when all maximal frequent itemsets are short. However, performance drastically deteriorates when some of the maximal frequent itemsets are long. We present a new algorithm which combines both the bottom-up and the top-down searches. The primary search direction is still bottom-up, but a restricted search is also conducted in the top-down direction. This search is used only for maintaining and updating a new data structure, the maximum frequent candidate set. It is used to prune early candidates that would be normally encountered in the bottom-up search. A very important characteristic of the algorithm is that it does not require explicit examination of every frequent itemset. Therefore, the algorithm performs well even when some maximal frequent itemsets are long. As its output, the algorithm produces the maximum frequent set, i.e., the set containing all maximal frequent itemsets, thus specifying immediately all frequent itemsets. We evaluate the performance of the algorithm using well-known synthetic benchmark databases, real-life census, and stock market databases. The improvement in performance can be up to several orders of magnitude, compared to the best previous algorithms.

引用

页码：553 / 566

页数：14

共 32 条

[1]

AGRAWAL R, 2000, J PARALLEL DISTRIBUT

[2]

Agrawal R., 1994, P 20 VER LARG DAT BA

[3]

AGRAWAL R, 1996, IEEE T KNOWLEDGE DAT, V7

[4]

AGRAWAL R, 1996, P 2 ACM SIGKDD INT C

[5]

AGRAWAL R, 1993, P ACM SPEC INT GROUP

[6]

BAYARDO R, 1997, P 3 ACM SIGKKD INT C

[7]

Bayardo R., 1998, P ACM SPEC INT GROUP

[8]

BRIN S, 1997, P ACM SPEC INT GROUP

[9]

FAYYAD UM, 1996, ADV KNOWLEDGE DISCOV

[10]

GUNOPULOS D, 1997, P 16 ACM SIGACT SIGM

← 1 2 3 4 →