Effective database transformation and efficient support computation for mining sequential patterns

被引：0

作者：

Cho, Chung-Wen ^{[2
]}

Wu, Yi-Hung ^{[3
]}

Chen, Arbee L. P. ^{[1
]}

机构：

[1] Natl Chengchi Univ, Dept Comp Sci, Taipei, Taiwan

[2] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu 30043, Taiwan

[3] Chung Yuan Christian Univ, Dept Informat & Comp Engn, Jhongli, Taiwan

来源：

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS | 2009年 / 32卷 / 01期

关键词：

Data mining; Sequential patterns; Database transformation; Support computation; Database projection; ALGORITHM;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose a novel algorithm for mining frequent sequences from transaction databases. The transactions of the same customers form a set of customer sequences. A sequence (an ordered list of itemsets) is frequent if the number of customer sequences containing it satisfies the user-specified threshold. The 1-sequence is a special type of sequences because it consists of only a single itemset instead of an ordered list, while the k-sequence is a sequence composed of k itemsets. Compared with the cost of mining frequent k-sequences (k a parts per thousand yenaEuro parts per thousand 2), the cost of mining frequent 1-sequences is negligible. We adopt a two-phase architecture to find the two types of frequent sequences separately in order that the discovery of frequent k-sequences can be well designed and optimized. For efficient frequent k-sequence mining, every frequent 1-sequence is encoded as a unique symbol and the database is transformed into one constituted by the symbols. We find that it is unnecessary to encode all the frequent 1-seqences, and make full use of the discovered frequent 1-sequences to transform the database into one with a smaller size. For every k a parts per thousand yenaEuro parts per thousand 2, the customer sequences in the transformed database are scanned to find all the frequent k-sequences. We devise the compact representation for a customer sequence and elaborate the method to enumerate all distinct subsequences from a customer sequence without redundant scans. The soundness of the proposed approach is verified and a number of experiments are performed. The results show that our approach outperforms the previous works in both scalability and execution time.

引用

页码：23 / 51

页数：29

共 12 条

[1]

AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415

[2]

Ayres J., 2002, P 8 ACM SIGKDD INT C, P429, DOI DOI 10.1145/775047.775109

[3] An efficient algorithm for mining frequent sequences by a new strategy without support counting [J].

Chiu, DY ;

Wu, YH ;

Chen, ALP .

20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, :375-386

[4]

Cho CW, 2005, LECT NOTES COMPUT SC, V3453, P163

[5] Mining sequential patterns by pattern-growth: The PrefixSpan approach [J].

Pei, J ;

Han, JW ;

Mortazavi-Asl, B ;

Wang, JY ;

Pinto, H ;

Chen, QM ;

Dayal, U ;

Hsu, MC .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (11) :1424-1440

[6]

Srikant R., 1996, Advances in Database Technology - EDBT '96. 5th International Conference on Extending Database Technology. Proceedings, P3

[7]

Sun XZ, 2003, THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P299

[8]

Tzvetkov P, 2003, THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P347

[9]

WANG K, 2002, P 6 PAC AR C KNOWL D, P334, DOI DOI 10.1007/3-540-47887-6_34

[10]

Yan XF, 2003, SIAM PROC S, P166

← 1 2 →