Towards an End-to-End Data Quality Optimizer

被引:0
作者
Restat, Valerie [1 ]
Klettke, Meike [2 ]
Stoerl, Uta [1 ]
机构
[1] Univ Hagen, Chair Databases & Informat Syst, Hagen, Germany
[2] Univ Regensburg, Chair Data Engn, Regensburg, Germany
来源
2024 IEEE 40TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, ICDEW | 2024年
关键词
data quality; data cleaning; optimization;
D O I
10.1109/ICDEW61823.2024.00039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To provide good results and decisions in data-driven systems, data quality must be ensured as a primary consideration. An important aspect of this is data cleaning. Although many different algorithms and tools already exist for data cleaning, an end-to-end data quality solution is still needed. In this paper, we present our vision of a well-founded end-to-end data quality optimizer. In contrast to many studies that consider data cleaning in the context of machine learning, our approach focuses on various scenarios, such as when preprocessing and downstream analysis are separated. Our proposed adaptive and easily extend-able framework operates similarly to proven methods of database query optimization. Analogously, it consists of the following parts: Rule-based optimization, where the appropriate data cleaning algorithms are selected based on use case constraints, optimizer hints in the form of best practices, and cost-based optimization, where the cost is measured in terms of data quality. Accordingly, the result is a data cleaning pipeline that provides the best possible data quality. The choice of different optimization goals enables further flexibility, e.g. for environments with limited resources.
引用
收藏
页码:262 / 266
页数:5
相关论文
共 24 条
[1]  
Abedjan Z, 2016, PROC VLDB ENDOW, V9, P993
[2]  
Blake RH., 2011, ACM J Data Inf Qual, V2, P8, DOI [10.1145/1891879.1891881, DOI 10.1145/1891879.1891881]
[3]  
Boehm M., 2020, CIDR'20
[4]  
Chu X., 2019, DATA CLEANING
[5]   Efficient query evaluation on probabilistic databases [J].
Dalvi, Nilesh ;
Suciu, Dan .
VLDB JOURNAL, 2007, 16 (04) :523-544
[6]  
Freytag J. C., 1987, SIGMOD Record, V16, P173, DOI 10.1145/38714.38735
[7]  
Giovanelli J., 2022, EDBT ICDT WORKSH
[8]   MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines [J].
Grafberger, Stefan ;
Guha, Shubha ;
Stoyanovich, Julia ;
Schelter, Sebastian .
SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, :2736-2739
[9]  
Haas P. J., 2018, Encyclopedia of Database Systems, VSecond, DOI [10.1007/978-1-4614-8265-980692, DOI 10.1007/978-1-4614-8265-980692]
[10]   Data Preparation: A Survey of Commercial Tools [J].
Hameed, Mazhar ;
Naumann, Felix .
SIGMOD RECORD, 2020, 49 (03) :18-29