AutoCure: Automated Tabular Data Curation for ML Pipelines

被引:2
作者
Abdelaal, Mohamed [1 ]
Koparde, Rashmi [2 ]
Schoening, Harald [1 ]
机构
[1] Software AG, Darmstadt, Hessen, Germany
[2] Univ Magdeburg, Magdeburg, Germany
来源
PROCEEDINGS OF THE SIXTH INTERNATIONAL WORKSHOP ON EXPLOITING ARTIFICIAL INTELLIGENCE TECHNIQUES FOR DATA MANAGEMENT, AIDM 2023 | 2023年
关键词
data curation; data quality; data augmentation; machine learning; tabular data;
D O I
10.1145/3593078.3593930
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning algorithms have become increasingly prevalent in multiple domains, such as autonomous driving, healthcare, and finance. In such domains, data preparation remains a significant challenge in developing accurate models, requiring significant expertise and time investment to explore the huge search space of well-suited data curation and transformation tools. To address this challenge, we present AutoCure, a novel and configuration-free data curation pipeline that improves the quality of tabular data. Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction through an adaptive ensemble-based error detection method and a data augmentation module. In practice, AutoCure can be integrated with open source tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of machine learning. As a proof of concept, we provide a comparative evaluation of AutoCure against 28 combinations of traditional data curation tools, demonstrating superior performance and predictive accuracy without user intervention. Our evaluation shows that AutoCure is an effective approach to automating data preparation and improving the accuracy of machine learning models.
引用
收藏
页数:11
相关论文
共 36 条
[1]  
Abdelaal Mohamed, 2023, 26 INT C EXT DAT TEC
[2]  
Abedjan Z, 2016, PROC VLDB ENDOW, V9, P993
[3]   Optuna: A Next-generation Hyperparameter Optimization Framework [J].
Akiba, Takuya ;
Sano, Shotaro ;
Yanase, Toshihiko ;
Ohta, Takeru ;
Koyama, Masanori .
KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, :2623-2631
[4]  
Arocena PC, 2015, PROC VLDB ENDOW, V9, P36
[5]  
Cheung T-H, 2020, INT C LEARN REPR
[6]   KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing [J].
Chu, Xu ;
Morcos, John ;
Ilyas, Ihab F. ;
Ouzzani, Mourad ;
Papotti, Paolo ;
Tang, Nan ;
Ye, Yin .
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :1247-1261
[7]  
Dallachiesa M., 2013, P 2013 INT C MAN DAT, P541
[8]  
Del Gaudio Daniel, 2023, 2023 IEEE INT C PERV
[9]   A survey on deep learning and its applications [J].
Dong, Shi ;
Wang, Ping ;
Abbas, Khushnood .
COMPUTER SCIENCE REVIEW, 2021, 40
[10]  
Dua D, 2017, UCI MACHINE LEARNING