On Studying the Effect of Data Quality on Classification Performances

被引:1
作者
Jouseau, Roxane [1 ]
Salva, Sebastien [1 ]
Samir, Chafik [1 ]
机构
[1] Univ Clermont Auvergne, Clermont Ferrand, France
来源
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2022 | 2022年 / 13756卷
关键词
Data quality; Data engineering; Data cleaning; Data repairing; Classification; Machine learning;
D O I
10.1007/978-3-031-21753-1_9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
During the last decade, data have played a key role for learning and decision making models. Unfortunately, the quality of data has been ignored or partially investigated as a pre-processing step. Motivated by applications in various fields, we propose to study data quality and its impact on the performance of several learning models. In this work, we first study the difficulty of repairing errors by introducing a list of elementary repairing tasks ranging from easy to complex with an increasing level. Then, we form categories from the state-of-the-art cleaning and repairing methods. We also investigate if it is always efficient to repair data. By including standard classifications models and public dataset, our work enables their use in different contexts and can be extended to other machine learning applications.
引用
收藏
页码:82 / 93
页数:12
相关论文
共 18 条
  • [1] Abedjan Z, 2016, PROC VLDB ENDOW, V9, P993
  • [2] A novel data repairing approach based on constraints and ensemble learning
    Ataeyan, Mahdieh
    Daneshpour, Negin
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2020, 159
  • [3] Blake C., 1998, UCI repository of machine learning databases
  • [4] Data Cleaning: Overview and Emerging Challenges
    Chu, Xu
    Ilyas, Ihab F.
    Krishnan, Sanjay
    Wang, Jiannan
    [J]. SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 2201 - 2206
  • [5] KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing
    Chu, Xu
    Morcos, John
    Ilyas, Ihab F.
    Ouzzani, Mourad
    Papotti, Paolo
    Tang, Nan
    Ye, Yin
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1247 - 1261
  • [6] Qualitative Data Cleaning
    Chu, Xu
    Ilyas, Ihab F.
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 9 (13): : 1605 - 1608
  • [7] Chu X, 2013, PROC INT CONF DATA, P458, DOI 10.1109/ICDE.2013.6544847
  • [8] Hynes N., 2017, The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets
  • [9] Jouseau R., 2022, TECHNICAL REPORT STU
  • [10] Krishnan S, 2017, Arxiv, DOI arXiv:1711.01299