Effective database transformation and efficient support computation for mining sequential patterns

被引:0
作者
Cho, Chung-Wen [2 ]
Wu, Yi-Hung [3 ]
Chen, Arbee L. P. [1 ]
机构
[1] Natl Chengchi Univ, Dept Comp Sci, Taipei, Taiwan
[2] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu 30043, Taiwan
[3] Chung Yuan Christian Univ, Dept Informat & Comp Engn, Jhongli, Taiwan
关键词
Data mining; Sequential patterns; Database transformation; Support computation; Database projection; ALGORITHM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a novel algorithm for mining frequent sequences from transaction databases. The transactions of the same customers form a set of customer sequences. A sequence (an ordered list of itemsets) is frequent if the number of customer sequences containing it satisfies the user-specified threshold. The 1-sequence is a special type of sequences because it consists of only a single itemset instead of an ordered list, while the k-sequence is a sequence composed of k itemsets. Compared with the cost of mining frequent k-sequences (k a parts per thousand yenaEuro parts per thousand 2), the cost of mining frequent 1-sequences is negligible. We adopt a two-phase architecture to find the two types of frequent sequences separately in order that the discovery of frequent k-sequences can be well designed and optimized. For efficient frequent k-sequence mining, every frequent 1-sequence is encoded as a unique symbol and the database is transformed into one constituted by the symbols. We find that it is unnecessary to encode all the frequent 1-seqences, and make full use of the discovered frequent 1-sequences to transform the database into one with a smaller size. For every k a parts per thousand yenaEuro parts per thousand 2, the customer sequences in the transformed database are scanned to find all the frequent k-sequences. We devise the compact representation for a customer sequence and elaborate the method to enumerate all distinct subsequences from a customer sequence without redundant scans. The soundness of the proposed approach is verified and a number of experiments are performed. The results show that our approach outperforms the previous works in both scalability and execution time.
引用
收藏
页码:23 / 51
页数:29
相关论文
共 12 条
[1]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[2]  
Ayres J., 2002, P 8 ACM SIGKDD INT C, P429, DOI DOI 10.1145/775047.775109
[3]   An efficient algorithm for mining frequent sequences by a new strategy without support counting [J].
Chiu, DY ;
Wu, YH ;
Chen, ALP .
20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, :375-386
[4]  
Cho CW, 2005, LECT NOTES COMPUT SC, V3453, P163
[5]   Mining sequential patterns by pattern-growth: The PrefixSpan approach [J].
Pei, J ;
Han, JW ;
Mortazavi-Asl, B ;
Wang, JY ;
Pinto, H ;
Chen, QM ;
Dayal, U ;
Hsu, MC .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (11) :1424-1440
[6]  
Srikant R., 1996, Advances in Database Technology - EDBT '96. 5th International Conference on Extending Database Technology. Proceedings, P3
[7]  
Sun XZ, 2003, THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P299
[8]  
Tzvetkov P, 2003, THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P347
[9]  
WANG K, 2002, P 6 PAC AR C KNOWL D, P334, DOI DOI 10.1007/3-540-47887-6_34
[10]  
Yan XF, 2003, SIAM PROC S, P166