FAHES: A Robust Disguised Missing Values Detector

被引:22
作者
Qahtan, Abdulhakim A. [1 ]
Elmagarmid, Ahmed [1 ]
Fernandez, Raul Castro [2 ]
Ouzzani, Mourad [1 ]
Tang, Nan [1 ]
机构
[1] HBKU, Qatar Comp Res Inst, Doha, Qatar
[2] MIT CSAIL, Cambridge, MA USA
来源
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2018年
关键词
Disguised Missing Value; Syntactic Outliers; Numerical Outliers; Syntactic Patterns;
D O I
10.1145/3219819.3220109
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Missing values are common in real-world data and may seriously affect data analytics such as simple statistics and hypothesis testing. Generally speaking, there are two types of missing values: explicitly missing values (i.e., NULL values), and implicitly missing values (a.k.a. disguised missing values (DMVs)) such as "11111111" for a phone number and "Some college" for education. While detecting explicitly missing values is trivial, detecting DMVs is not; the essential challenge is the lack of standardization about how DMVs are generated. In this paper, we present FAHES, a robust system for detecting DMVs from two angles: DMVs as detectable outliers and as detectable inliers. For DMVs as outliers, we propose a syntactic outlier detection module for categorical data, and a density-based outlier detection module for numerical values. For DMVs as inliers, we propose a method that detects DMVs which follow either missing-completely-at-random or missing-at-random models. The robustness of FAHES is achieved through an ensemble technique that is inspired by outlier ensembles. Our extensive experiments using real-world data sets show that FAHES delivers better results than existing solutions.
引用
收藏
页码:2100 / 2109
页数:10
相关论文
共 31 条
[1]  
Abedjan Z, 2016, PROC VLDB ENDOW, V9, P993
[2]  
Aggarwal C.C, 2013, ACM SIGKDD Explor. Newsl, V14, P49, DOI DOI 10.1145/2481244.2481252
[3]  
Angiulli F., 2007, CIKM, P811, DOI [10.1145/1321440.1321552, DOI 10.1145/1321440.1321552]
[4]  
[Anonymous], 2015, ACM SIGKDD explorations newsletter, DOI [DOI 10.1145/2830544.2830549, 10.1145/2830544.2830549]
[5]  
[Anonymous], 2001, MISSING DATA
[6]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[7]  
Buhlmann P., 2002, RECENT ADV TRENDS NO, P927
[8]  
CHURCH KW, 1990, 27TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P76
[9]  
Dallachiesa Michele, 2013, SIGMOD International Conference on Management of Data, SIGMOD '13, P541, DOI 10.1145/2463676.2465327
[10]   Kernel Density Estimation (KDE) with adaptive bandwidth selection for environmental contours of extreme sea states [J].
Eckert-Gallup, Aubrey ;
Martin, Nevin .
OCEANS 2016 MTS/IEEE MONTEREY, 2016,