A Benchmark for Data Imputation Methods

被引:83
作者
Jaeger, Sebastian [1 ]
Allhorn, Arndt [1 ]
Biessmann, Felix [1 ]
机构
[1] Beuth Univ Appl Sci, Berlin, Germany
来源
FRONTIERS IN BIG DATA | 2021年 / 4卷
关键词
data quality; data cleaning; imputation; missing data; benchmark; MCAR; MNAR; MAR; MISSING DATA; ERRORS;
D O I
10.3389/fdata.2021.693674
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.
引用
收藏
页数:16
相关论文
共 53 条
  • [1] Abedjan Z., 2018, SYNTHESIS LECT DATA, V10, P1, DOI [10.2200/s00878ed1v01y201810dtm052, DOI 10.2200/S00878ED1V01Y201810DTM052]
  • [2] Abedjan Z, 2016, PROC VLDB ENDOW, V9, P993
  • [3] [Anonymous], 2018, P MACHINE LEARNING R
  • [4] [Anonymous], 2017, P INT C NEUR INF PRO
  • [5] Batista GEAPA, 2003, APPL ARTIF INTELL, V17, P519, DOI 10.1080/08839510390219309
  • [6] TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
    Baylor, Denis
    Breck, Eric
    Cheng, Heng-Tze
    Fiedel, Noah
    Foo, Chuan Yu
    Haque, Zakaria
    Haykal, Salem
    Ispir, Mustafa
    Jain, Vihan
    Koc, Levent
    Koo, Chiu Yuen
    Lew, Lukasz
    Mewald, Clemens
    Modi, Akshay Naresh
    Polyzotis, Neoklis
    Ramesh, Sukriti
    Roy, Sudip
    Whang, Steven Euijong
    Wicke, Martin
    Wilkiewicz, Jarek
    Zhang, Xin
    Zinkevich, Martin
    [J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 1387 - 1395
  • [7] On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
    Bender, Emily M.
    Gebru, Timnit
    McMillan-Major, Angelina
    Shmitchell, Shmargaret
    [J]. PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, : 610 - 623
  • [8] Bertsimas D, 2018, J MACH LEARN RES, V18
  • [9] Biessmann F., 2021, B IEEE COMPUTER SOC
  • [10] Biessmann F, 2019, J MACH LEARN RES, V20