Exploring and cleaning big data with random sample data blocks

被引:15
作者
Salloum, Salman [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
He, Yulin [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
基金
国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;
关键词
Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;
D O I
10.1186/s40537-019-0205-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.
引用
收藏
页数:28
相关论文
共 58 条
[31]  
Krishnan Sanjay, 2015, IEEE Data Engineering Bulletin, V38, P59
[32]   Visualization Viewpoints Sampling for Scalable Visual Analytics [J].
Kwon, Bum Chul ;
Verma, Janu ;
Haas, Peter J. ;
Demiralp, Cagatay .
IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2017, 37 (01) :100-108
[33]  
Landset S., 2015, J BIG DATA, V2, P24, DOI [10.1186/s40537-015-0032-1, DOI 10.1186/S40537-015-0032-1]
[34]   Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median [J].
Leys, Christophe ;
Ley, Christophe ;
Klein, Olivier ;
Bernard, Philippe ;
Licata, Laurent .
JOURNAL OF EXPERIMENTAL SOCIAL PSYCHOLOGY, 2013, 49 (04) :764-766
[35]  
Liu J., 2014, Proceedings of the 2014 Workshop on Human Centered Big Data Research p, P49, DOI DOI 10.1145/2609876.2609888
[36]  
Matteo Riondato, 2014, RIONDATO MATTEO SAMP, P516, DOI [10.1007/978-3-662-44845-8_48, DOI 10.1007/978-3-662-44845-8_48]
[37]   Big Data Needs Approximate Computing [J].
Nair, Ravi .
COMMUNICATIONS OF THE ACM, 2015, 58 (01) :104-104
[38]  
Park Y, 2015, ARXIVABS151003921
[39]   Data Lifecycle Challenges in Production Machine Learning: A Survey [J].
Polyzotis, Neoklis ;
Roy, Sudip ;
Whang, Steven Euijong ;
Zinkevich, Martin .
SIGMOD RECORD, 2018, 47 (02) :17-28
[40]  
Prokoshyna N, 2015, PROC VLDB ENDOW, V9, P300