Human-in-the-Loop Feature Discovery for Tabular Data

被引:0
作者
Ionescu, Andra [1 ]
Mouw, Zeger [1 ]
Aivaloglou, Efthimia [1 ]
Hai, Rihan [1 ]
Katsifodimos, Asterios [1 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
来源
PROCEEDINGS OF THE 33RD ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2024 | 2024年
关键词
Human-in-the-Loop; Feature Discovery; Tabular Data; Data Science; AutoML;
D O I
10.1145/3627673.3679211
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, researchers have developed several methods to automate discovering datasets and augmenting features for training Machine Learning (ML) models. Together with feature selection, these efforts have paved the way towards what is termed the feature discovery process. Data scientists and engineers use automated feature discovery over tabular datasets to add new features from different sources and enrich training data. By surveying data practitioners, we have observed that automated feature discovery approaches do not allow data scientists to use their domain knowledge during the feature discovery process. In addition, automated feature discovery methods can leak private features or introduce biased ones. In this paper, we introduce the first user-driven human-in-the-loop feature discovery method called HILAutoFeat. We demonstrate the capabilities of HILAutoFeat, which effectively combines automated feature discovery with user-driven insights. Our demonstration is centred around two scenarios: (i) an automated feature discovery scenario - HILAutoFeat acts as a steward in a large data lake where the user is unaware of the quality and relevance of the data, and (ii) a scenario where HILAutoFeat and the user work together - the user drives the feature discovery process by adding his domain and business knowledge, while HILAutoFeat performs the intensive computations.
引用
收藏
页码:5215 / 5219
页数:5
相关论文
共 15 条
[1]   Dataset Discovery in Data Lakes [J].
Bogatu, Alex ;
Fernandes, Alvaro A. A. ;
Paton, Norman W. ;
Konstantinou, Nikolaos .
2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, :709-720
[2]   ARDA: Automatic Relational Data Augmentation for Machine Learning [J].
Chepurko, Nadiia ;
Marcus, Ryan ;
Zgraggen, Emanuel ;
Castro Fernandez, Raul ;
Kraska, Tim ;
Karger, David .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (09) :1373-1387
[3]  
Erickson Nick, 2020, arXiv
[4]  
Fan Grace, 2023, SIGMOD/PODS '23: Companion of the 2023 International Conference on Management of Data, P69, DOI 10.1145/3555041.3589409
[5]   Aurum: A Data Discovery System [J].
Fernandez, Raul Castro ;
Abedjan, Ziawasch ;
Koko, Famien ;
Yuan, Gina ;
Madden, Sam ;
Stonebraker, Michael .
2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, :1001-1012
[6]   Data Lakes: A Survey of Functions and Systems [J].
Hai, Rihan ;
Koutras, Christos ;
Quix, Christoph ;
Jarke, Matthias .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (12) :12571-12590
[7]   Key Insights from a Feature Discovery User Study [J].
Ionescu, Andra ;
Mouw, Zeger ;
Aivaloglou, Efthimia ;
Katsifodimos, Asterios .
WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024, 2024,
[8]  
Ionescu Andra, 2024, 2024 IEEE 40 INT C D, P1861, DOI [10.1109/ICDE60146.2024.00150, DOI 10.1109/ICDE60146.2024.00150]
[9]   Valentine: Evaluating Matching Techniques for Dataset Discovery [J].
Koutras, Christos ;
Siachamis, George ;
Ionescu, Andra ;
Psarakis, Kyriakos ;
Brons, Jerry ;
Fragkoulis, Marios ;
Lofi, Christoph ;
Bonifati, Angela ;
Katsifodimos, Asterios .
2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, :468-479
[10]   To Join or Not to Join? Thinking Twice about Joins before Feature Selection [J].
Kumar, Arun ;
Naughton, Jeffrey ;
Patel, Jignesh M. ;
Zhu, Xiaojin .
SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, :19-34