Mining the frequency of time-constrained serial episodes over massive data sequences and streams

被引:2
作者
Li, Hui [1 ,2 ]
Li, Zhe [1 ]
Peng, Sizhe [1 ]
Li, Jingjing [3 ]
Tungom, Chia Emmanuel [1 ]
机构
[1] Xidian Univ, Sch Cyber Engn, Xian 710071, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian, Peoples R China
[3] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Hong Kong, Peoples R China
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2020年 / 110卷
基金
中国国家自然科学基金;
关键词
Spark; Sequence mining; Serial episode; Frequency; Stream; PATTERNS;
D O I
10.1016/j.future.2019.11.008
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the popularity and development of the Internet, telecommunication, industrial systems etc., massive amounts of event sequences and streams have been and are being produced. These sequences and streams are generated at a fast pace posing grand challenges in computation and analysis. On one hand, due to the huge number of events, analyzing the sequences is time-consuming. On the other hand, as events in a stream may not necessarily arrive in uniform speed, an effective computational model over the stream should be able to accommodate the intensive arrival of events. In this work, we focus on frequency evaluation which is one representative task in sequence and stream analysis. To address the challenges listed above, we present a one-pass algorithm, namely ONCE, which outputs a popularly used frequency from a given sequence. Moreover, we also present a pair of advanced models, SparkONCE and StreamingONCE, respectively. Both of these approaches are built on ONCE. With a series of non-trivial strategies carefully designed towards Spark, SparkONCE and StreamingONCE exhibit superior performances with respect to ONCE. In particular, compared to ONCE, SparkONCE significantly improves the efficiency in massive sequences; StreamingONCE can effectively adapt to the uneven speed for the events in a stream. The experimental study on real-world and synthetic datasets demonstrate that the proposed approach can work well on massive sequences and streams. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:849 / 863
页数:15
相关论文
共 33 条
[1]   Pattern-growth based frequent serial episode discovery [J].
Achar, Avinash ;
Ibrahim, A. ;
Sastry, P. S. .
DATA & KNOWLEDGE ENGINEERING, 2013, 87 :91-108
[2]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[3]  
[Anonymous], 2006, Proceedings of the 2006 ACM SIGMOD international conference on Management of data-SIGMOD'06, DOI DOI 10.1145/1142473.1142520
[4]   Mining Precise-positioning Episode Rules from Event Sequences [J].
Ao, Xiang ;
Luo, Ping ;
Wang, Jin ;
Zhuang, Fuzhen ;
He, Qing .
2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, :83-86
[5]  
Ao X, 2015, PROC INT CONF DATA, P891, DOI 10.1109/ICDE.2015.7113342
[6]   Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns [J].
Bertens, Roel ;
Vreeken, Jilles ;
Siebes, Arno .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :735-744
[7]   Competitors, complementors, parents and places: Explaining regional agglomeration in the US auto industry [J].
Cabral, Luis ;
Wang, Zhu ;
Xu, Daniel Yi .
REVIEW OF ECONOMIC DYNAMICS, 2018, 30 :1-29
[8]   CPS data streams analytics based on machine learning for Cloud and Fog Computing: A survey [J].
Fei, Xiang ;
Shah, Nazaraf ;
Verba, Nandor ;
Chao, Kuo-Ming ;
Sanchez-Anguix, Victor ;
Lewandowski, Jacek ;
James, Anne ;
Usman, Zahid .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 90 :435-450
[9]   Balanced Parallel Frequent Pattern Mining Over Massive Data Stream [J].
Fu, Xi ;
Shi, Lei ;
Li, Jing .
2017 THIRD IEEE INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2017), 2017, :50-59
[10]  
Golmohammadi K., 2012, 2012 European Intelligence and Security Informatics Conference (EISIC), P107, DOI 10.1109/EISIC.2012.51