Mining the frequency of time-constrained serial episodes over massive data sequences and streams

被引:2
作者
Li, Hui [1 ,2 ]
Li, Zhe [1 ]
Peng, Sizhe [1 ]
Li, Jingjing [3 ]
Tungom, Chia Emmanuel [1 ]
机构
[1] Xidian Univ, Sch Cyber Engn, Xian 710071, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian, Peoples R China
[3] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Hong Kong, Peoples R China
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2020年 / 110卷
基金
中国国家自然科学基金;
关键词
Spark; Sequence mining; Serial episode; Frequency; Stream; PATTERNS;
D O I
10.1016/j.future.2019.11.008
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the popularity and development of the Internet, telecommunication, industrial systems etc., massive amounts of event sequences and streams have been and are being produced. These sequences and streams are generated at a fast pace posing grand challenges in computation and analysis. On one hand, due to the huge number of events, analyzing the sequences is time-consuming. On the other hand, as events in a stream may not necessarily arrive in uniform speed, an effective computational model over the stream should be able to accommodate the intensive arrival of events. In this work, we focus on frequency evaluation which is one representative task in sequence and stream analysis. To address the challenges listed above, we present a one-pass algorithm, namely ONCE, which outputs a popularly used frequency from a given sequence. Moreover, we also present a pair of advanced models, SparkONCE and StreamingONCE, respectively. Both of these approaches are built on ONCE. With a series of non-trivial strategies carefully designed towards Spark, SparkONCE and StreamingONCE exhibit superior performances with respect to ONCE. In particular, compared to ONCE, SparkONCE significantly improves the efficiency in massive sequences; StreamingONCE can effectively adapt to the uneven speed for the events in a stream. The experimental study on real-world and synthetic datasets demonstrate that the proposed approach can work well on massive sequences and streams. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:849 / 863
页数:15
相关论文
共 33 条
[21]  
Pei J, 2001, PROC INT CONF DATA, P215
[22]  
Pei W., 2018, THESIS
[23]  
Saha I., 2016, INT J BIOTECH TRENDS, V18
[24]   An intermediate data placement algorithm for load balancing in Spark computing environment [J].
Tang, Zhuo ;
Zhang, Xiangshen ;
Li, Kenli ;
Li, Keqin .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 78 :287-301
[25]   A Parallel Conditional Random Fields Model Based on Spark Computing Environment [J].
Tang, Zhuo ;
Fu, Zhongming ;
Gong, Zherong ;
Li, Kenli ;
Li, Keqin .
JOURNAL OF GRID COMPUTING, 2017, 15 (03) :323-342
[26]   A hybrid knowledge-based recommender system for e-learning based on ontology and sequential pattern mining [J].
Tarus, John K. ;
Niu, Zhendong ;
Yousif, Abdallah .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 72 :37-48
[27]   Bacteria foraging optimization for protein sequence analysis on the grid [J].
Vivekanandan, K. ;
Ramyachitra, D. .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2012, 28 (04) :647-656
[28]   A Differentially Private Unscented Kalman Filter for Streaming Data in IoT [J].
Wang, Jun ;
Zhu, Rongbo ;
Liu, Shubo .
IEEE ACCESS, 2018, 6 :6487-6495
[29]   ParGen: A Parallel Method for Partitioning Data Stream Applications in Mobile Edge Computing [J].
Wen, Haohuang ;
Yang, Lei ;
Wang, Zhenyu .
IEEE ACCESS, 2018, 6 :5037-5048
[30]  
Wu Eugene, 2006, P 2006 ACM SIGMOD IN, P407