A Subsequence Interleaving Model for Sequential Pattern Mining

被引:33
作者
Fowkes, Jaroslav [1 ]
Sutton, Charles [1 ]
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh EH8 9AB, Midlothian, Scotland
来源
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2016年
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1145/2939672.2939787
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent sequential pattern mining methods have used the minimum description length (MDL) principle to define an encoding scheme which describes an algorithm for mining the most compressing patterns in a database. We present a novel subsequence interleaving model based on a probabilistic model of the sequence database, which allows us to search for the most compressing set of patterns without designing a specific encoding scheme. Our proposed algorithm is able to efficiently mine the most relevant sequential patterns and rank them using an associated measure of interestingness. The efficient inference in our model is a direct result of our use of a structural expectation-maximization framework, in which the expectation-step takes the form of a submodular optimization problem subject to a coverage constraint. We show on both synthetic and real world datasets that our model mines a set of sequential patterns with low spuriousness and redundancy, high interpretability and usefulness in real-world applications. Furthermore, we demonstrate that the quality of the patterns from our approach is comparable to, if not better than, existing state of the art sequential pattern mining algorithms.
引用
收藏
页码:835 / 844
页数:10
相关论文
共 30 条
[1]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[2]  
[Anonymous], 2014, FREQUENT PATTERN MIN, DOI DOI 10.1007/978-3-319-07821-2
[3]  
Ayres J., 2002, P ACM SIGKDD INT C K, P429
[4]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[5]  
Chvatal V., 1979, Mathematics of Operations Research, V4, P233, DOI 10.1287/moor.4.3.233
[6]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[7]   A threshold of in n for approximating set cover [J].
Feige, U .
JOURNAL OF THE ACM, 1998, 45 (04) :634-652
[8]  
Gwadera R, 2005, SIAM PROC S, P404
[9]  
Jay N., 2004, MEDINFO, p, P1667
[10]  
Kohavi R., 2000, SIGKDD EXPLORATIONS, V2, P86, DOI [DOI 10.1145/380995.381033, 10.1145/380995.381033]