AutoCure: Automated Tabular Data Curation for ML Pipelines

被引:2
作者
Abdelaal, Mohamed [1 ]
Koparde, Rashmi [2 ]
Schoening, Harald [1 ]
机构
[1] Software AG, Darmstadt, Hessen, Germany
[2] Univ Magdeburg, Magdeburg, Germany
来源
PROCEEDINGS OF THE SIXTH INTERNATIONAL WORKSHOP ON EXPLOITING ARTIFICIAL INTELLIGENCE TECHNIQUES FOR DATA MANAGEMENT, AIDM 2023 | 2023年
关键词
data curation; data quality; data augmentation; machine learning; tabular data;
D O I
10.1145/3593078.3593930
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning algorithms have become increasingly prevalent in multiple domains, such as autonomous driving, healthcare, and finance. In such domains, data preparation remains a significant challenge in developing accurate models, requiring significant expertise and time investment to explore the huge search space of well-suited data curation and transformation tools. To address this challenge, we present AutoCure, a novel and configuration-free data curation pipeline that improves the quality of tabular data. Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction through an adaptive ensemble-based error detection method and a data augmentation module. In practice, AutoCure can be integrated with open source tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of machine learning. As a proof of concept, we provide a comparative evaluation of AutoCure against 28 combinations of traditional data curation tools, demonstrating superior performance and predictive accuracy without user intervention. Our evaluation shows that AutoCure is an effective approach to automating data preparation and improving the accuracy of machine learning models.
引用
收藏
页数:11
相关论文
共 36 条
[21]   Raha: A Configuration-Free Error Detection System [J].
Mahdavi, Mohammad ;
Abedjan, Ziawasch ;
Fernandez, Raul Castro ;
Madden, Samuel ;
Ouzzani, Mourad ;
Stonebraker, Michael ;
Tang, Nan .
SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2019, :865-882
[22]  
Mariet Z, 2016, OUTLIER DETECTION HE
[23]   ED2: A Case for Active Learning in Error Detection [J].
Neutatz, Felix ;
Mahdavi, Mohammad ;
Abedjan, Ziawasch .
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, :2249-2252
[24]  
Neutatz Felix, 2021, IEEE Data Eng. Bull., V44, P24
[25]  
Oliver Birgelen Alexander, 2018, Smart factory: High storage system data for energy optimization
[26]  
Kingma DP, 2014, Arxiv, DOI [arXiv:1312.6114, 10.48550/arXiv.1312.6114]
[27]  
Papenbrock T, 2015, PROC VLDB ENDOW, V8, P1861
[28]   Discovery of Approximate (and Exact) Denial Constraints [J].
Pena, Eduardo H. M. ;
de Almeida, Eduardo C. ;
Naumann, Felix .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 13 (03) :266-278
[29]   FAHES: A Robust Disguised Missing Values Detector [J].
Qahtan, Abdulhakim A. ;
Elmagarmid, Ahmed ;
Fernandez, Raul Castro ;
Ouzzani, Mourad ;
Tang, Nan .
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, :2100-2109
[30]  
Rekatsinas T, 2017, Arxiv, DOI [arXiv:1702.00820, DOI 10.14778/3137628.3137631]