Sharing-Aware Outlier Analytics over High-Volume Data Streams

被引:22
作者
Cao, Lei [1 ]
Wang, Jiayuan [2 ]
Rundensteiner, Elke A. [2 ]
机构
[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
[2] Worcester Polytech Inst, Worcester, MA 01609 USA
来源
SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2016年
基金
美国国家科学基金会;
关键词
Outlier; Stream; Multi-query; QUERIES;
D O I
10.1145/2882903.2882920
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Real-time analytics of anomalous phenomena on streaming data typically relies on processing a large variety of continuous outlier detection requests, each configured with different parameter settings. The processing of such complex outlier analytics workloads is resource consuming due to the algorithmic complexity of the outlier mining process. In this work we propose a sharing-aware multi query execution strategy for outlier detection on data streams called SOP. A key insight of SOP is to transform the problem of handling a multi-query outlier analytics workload into a single-query skyline computation problem. We prove that the output of the skyline computation process corresponds to the minimal information needed for determining the outlier status of any point in the stream. Based on this new formulation, we design a customized skyline algorithm called K-SKY that leverages the domination relationships among the streaming data points to minimize the number of data points that must be evaluated for supporting multi-query outlier detection. Based on this K-SKY algorithm, our SOP solution achieves minimal utilization of both computational and memory resources for the processing of these complex outlier analytics workload. Our experimental study demonstrates that SOP consistently outperforms the state-of-art solutions by three orders of magnitude in CPU time, while only consuming 5% of their memory footprint a clear win win. Furthermore, SOP is shown to scale to large workloads composed of thousands of parameterized queries.
引用
收藏
页码:527 / 540
页数:14
相关论文
共 22 条
[1]  
Angiulli F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P15
[2]   Distance-based outlier queries in data streams: the novel task and algorithms [J].
Angiulli, Fabrizio ;
Fassetti, Fabio .
DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 20 (02) :290-324
[3]  
[Anonymous], 1980, IDENTIFICATION OUTLI, DOI DOI 10.1007/978-94-015-3994-4
[4]   The CQL continuous query language: semantic foundations and query execution [J].
Arasu, A ;
Babu, S ;
Widom, J .
VLDB JOURNAL, 2006, 15 (02) :121-142
[5]  
Arasu A., 2004, VLDB, V30, P336, DOI DOI 10.1016/B978-012088469-8.50032-2
[6]  
Bay S.D, 2003, KDD 03, P29, DOI [10.1145/956750.956758, DOI 10.1145/956750.956758]
[7]  
Bohm C., 2007, ICDE, P156
[8]  
Cao L, 2014, PROC INT CONF DATA, P76, DOI 10.1109/ICDE.2014.6816641
[9]  
Gupta Chetan, 2009, 2009 IEEE Conference on Commerce and Enterprise Computing, P33, DOI 10.1109/CEC.2009.74
[10]  
Hammad MoustafaA., 2003, P 29 INT C VERY LARG, V29, P297