Online Outlier Exploration Over Large Datasets

被引：15

作者：

Cao, Lei ^{[1
]}

Wei, Mingrui ^{[1
]}

Yang, Di ^{[2
]}

Rundensteiner, Elke A. ^{[1
]}

机构：

[1] Worcester Polytech Inst, Worcester, MA 01609 USA

[2] Oracle Corp, Nashua, NH 03062 USA

来源：

KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2015年

基金：

美国国家科学基金会;

关键词：

Outlier; Online Exploration; Parameter Setting;

D O I：

10.1145/2783258.2783387

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting One at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the appropriate parameter setting and desired results. In this work, we present the first online outlier exploration platform, called ONION, that enables analysts to effectively explore anomalies even in large datasets. First, ONION features an innovative interactive anomaly exploration model that offers an "outlier centric panorama" into big datasets along with rich classes of exploration operations. Second, to achieve this model ONION employs an online processing framework composed of a one time offline preprocessing phase followed by an online exploration phase that enables users to interactively explore the data. The preprocessing phase compresses raw big data into a knowledge-rich ONION abstraction that encodes critical interrelationships of outlier candidates so to support subsequent interactive outlier exploration. For the interactive exploration phase, our ONION framework provides several processing strategies that efficiently support the outlier exploration operations. Our user study with real data confirms the effectiveness of ONION in recognizing "true" outliers. Furthermore as demonstrated by our extensive experiments with large datasets, ONION supports all exploration operations within milliseconds response time.

引用

页码：89 / 98

页数：10

共 18 条

[1] DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets [J].

Angiulli, Fabrizio ;

Fassetti, Fabio .

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (01)

[2]

Ankerst M., 1999, SIGMOD Record, V28, P49, DOI 10.1145/304181.304187

[3]

Bay S.D., 2003, KDD, P29, DOI DOI 10.1145/956750.956758

[4]

Bhaduri K., 2011, Proceedings of the 17th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, P859, DOI DOI 10.1145/2020408.2020554

[5] LOF: Identifying density-based local outliers [J].