Sequential Pattern Mining with Wildcards

被引:9
作者
Xie, Fei [1 ,3 ]
Wu, Xindong [1 ,2 ]
Hu, Xuegang [1 ]
Gao, Jun [1 ]
Guo, Dan [1 ]
Fei, Yulian [4 ]
Hua, Ertian [4 ]
机构
[1] Hefei Univ Tech, Coll Comp Sci & Info Eng, Hefei, Peoples R China
[2] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
[3] Hefei Normal Univ, Dept Comp Sci & Technol, Hefei, Peoples R China
[4] Zhejiang Gongshang Univ, Coll Comp Sci & Info Eng, Hangzhou, Peoples R China
来源
22ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2010), PROCEEDINGS, VOL 1 | 2010年
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
component; sequential pattern mining; wildcard; candidate occurrence pruning; one-off condition;
D O I
10.1109/ICTAI.2010.42
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential pattern mining is an important research task in many domains, such as biological science. In this paper, we study the problem of mining frequent patterns from sequences with wildcards. The user can specify the gap constraints with flexibility. Given a subject sequence, a minimal support threshold and a gap constraint, we aim to find frequent patterns whose supports in the sequence are no less than the given support threshold. We design an efficient mining algorithm MAIL (1) that utilizes the candidate occurrences of the prefix to compute the support of a pattern that avoids the rescanning of the sequence. We present two pruning strategies to improve the completeness and the time efficiency of MAIL. Experiments show that MAIL mines 2 times more patterns than one of its peers and the time performance is 12 times faster on average than its another peer.
引用
收藏
页数:7
相关论文
共 15 条
[1]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[2]  
Ayres J., 2002, P ACM SIGKDD INT C K, P429
[3]   Efficient string matching with wildcards and length constraints [J].
Chen, Gong ;
Wu, Xindong ;
Zhu, Xingquan ;
Arslan, Abdullah N. ;
He, Yu .
KNOWLEDGE AND INFORMATION SYSTEMS, 2006, 10 (04) :399-419
[4]  
Ding BL, 2009, PROC INT CONF DATA, P1024, DOI 10.1109/ICDE.2009.104
[5]  
He Y, 2007, IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, P329
[6]   10-11 bp periodicities in complete genomes reflect protein structure and DNA folding [J].
Herzel, H ;
Weiss, O ;
Trifonov, EN .
BIOINFORMATICS, 1999, 15 (03) :187-193
[7]  
Ji XN, 2005, FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P194
[8]  
Li C., 2008, PROC SDM, P313
[9]  
Pei J, 2001, PROC INT CONF DATA, P215
[10]  
Srikant R., 1996, P EDBT 96 AV FRANC, P13