Exploring and cleaning big data with random sample data blocks

被引:15
作者
Salloum, Salman [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
He, Yulin [1 ,2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
基金
国家重点研发计划; 中国博士后科学基金; 中国国家自然科学基金;
关键词
Big data; Exploratory data analysis; Statistical estimation; Data cleaning; Block-level sampling; Random sample partition; Distributed; Parallel and cluster computing;
D O I
10.1186/s40537-019-0205-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.
引用
收藏
页数:28
相关论文
共 58 条
  • [1] Abedjan Z, 2016, PROC VLDB ENDOW, V9, P993
  • [2] Agarwal S., 2013, P 8 ACM EUR C COMP S, P29
  • [3] Anderson M. R., 2016, IEEE Data Eng. Bull., V39, P62
  • [4] [Anonymous], 2018, THESIS
  • [5] [Anonymous], 2016, DATA ANAL HADOOP INT
  • [6] Budiu Mihai., 2015, Eurographics Symposium on Parallel Graphics and Visualization, DOI DOI 10.2312/PGV.20161180
  • [7] Chaudhuri Surajit., 2004, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD '04, P287
  • [8] Data Center Energy Efficiency: Improving Energy Efficiency in Data Centers Beyond Technology Scaling
    Chong, Frederic T.
    Saleh, Adel A. M.
    Heck, Martijn J. R.
    Ranganathan, Parthasarathy
    Wassel, Hassan M. G.
    [J]. IEEE DESIGN & TEST, 2014, 31 (01) : 93 - 104
  • [9] Data Cleaning: Overview and Emerging Challenges
    Chu, Xu
    Ilyas, Ihab F.
    Krishnan, Sanjay
    Wang, Jiannan
    [J]. SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 2201 - 2206
  • [10] Ci X., 2015, EFFICIENT BLOCK SAMP, P362