Big Data Pre-Processing: A Quality Framework

被引:65
作者
Taleb, Ikbal [1 ]
Dssouli, Rachida [1 ]
Serhani, Mohamed Adel [2 ]
机构
[1] Concordia Univ, CIISE, Montreal, PQ, Canada
[2] UAE Univ, Coll Informat Technol, Al Ain, U Arab Emirates
来源
2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015 | 2015年
关键词
Big Data; Data Quality; pre-processing; DATA PROVENANCE; CHALLENGES; MANAGEMENT; ANALYTICS;
D O I
10.1109/BigDataCongress.2015.35
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the abundance of raw data generated from various sources, Big Data has become a preeminent approach in acquiring, processing, and analyzing large amounts of heterogeneous data to derive valuable evidences. The size, speed, and formats in which data is generated and processed affect the overall quality of information. Therefore, Quality of Big Data (QBD) has become an important factor to ensure that the quality of data is maintained at all Big data processing phases. This paper addresses the QBD at the pre-processing phase, which includes sub-processes like cleansing, integration, filtering, and normalization. We propose a QBD model incorporating processes to support Data quality profile selection and adaptation. In addition, it tracks and registers on a data provenance repository the effect of every data transformation happened in the pre-processing phase. We evaluate the data quality selection module using large EEG dataset. The obtained results illustrate the importance of addressing QBD at an early phase of Big Data processing lifecycle since it significantly save on costs and perform accurate data analysis.
引用
收藏
页码:191 / 198
页数:8
相关论文
共 36 条
  • [1] Towards a Semantic Extract-Transform-Load (ETL) framework for Big Data Integration
    Bansal, Srividya K.
    [J]. 2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 521 - 528
  • [2] Methodologies for Data Quality Assessment and Improvement
    Batini, Carlo
    Cappiello, Cinzia
    Francalanci, Chiara
    Maurino, Andrea
    [J]. ACM COMPUTING SURVEYS, 2009, 41 (03)
  • [3] Cappiello C., 2013, APPROACH DESIGN BUSI
  • [4] Milieu: Lightweight and Configurable Big Data Provenance for Science
    Cheah, You-Wei
    Canon, Richard
    Plale, Beth
    Ramakrishnan, Lavanya
    [J]. 2013 IEEE INTERNATIONAL CONGRESS ON BIG DATA, 2013, : 46 - 53
  • [5] Data-intensive applications, challenges, techniques and technologies: A survey on Big Data
    Chen, C. L. Philip
    Zhang, Chun-Yang
    [J]. INFORMATION SCIENCES, 2014, 275 : 314 - 347
  • [6] Big Data: A Survey
    Chen, Min
    Mao, Shiwen
    Liu, Yunhao
    [J]. MOBILE NETWORKS & APPLICATIONS, 2014, 19 (02) : 171 - 209
  • [7] Discovering Data Quality Rules
    Chiang, Fei
    Miller, Renee J.
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 1166 - 1177
  • [8] Dong XL, 2013, PROC INT CONF DATA, P1245, DOI 10.1109/ICDE.2013.6544914
  • [9] NADEEF: A Generalized Data Cleaning System
    Ebaid, Amr
    Elmagarmid, Ahmed
    Ilyas, Ihab F.
    Ouzzani, Mourad
    Quiane-Ruiz, Jorge-Arnulfo
    Tang, Nan
    Yin, Si
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (12): : 1218 - 1221
  • [10] Determining the Currency of Data
    Fan, Wenfei
    Geerts, Floris
    Wijsen, Jef
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2012, 37 (04):