A survey on dataset quality in machine learning

被引:105
作者
Gong, Youdi [1 ,2 ]
Liu, Guangzhen [1 ]
Xue, Yunzhi [1 ]
Li, Rui [1 ]
Meng, Lingzhong [1 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing 100190, Peoples R China
[2] Beihang Univ, Beijing 100191, Peoples R China
关键词
Dataset; Dataset quality; Machine Learning;
D O I
10.1016/j.infsof.2023.107268
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, including the definition of related concepts, analysis of quality issues and risks, and a review of dataset quality dimensions and metrics throughout the dataset lifecycle and a review of dataset quality metrics analyzed from a dataset lifecycle perspective and summarized in literatures. Furthermore, this article introduces a comprehensive quality evaluation process, which includes a framework for dataset quality evaluation with dimensions and metrics, computation methods for quality metrics, and assessment models. These studies provide valuable guidance for evaluating dataset quality in the field of machine learning, which can help improve the accuracy, efficiency, and generalization ability of machine learning models, and promote the development and application of artificial intelligence technology.
引用
收藏
页数:12
相关论文
共 68 条
[1]  
Abdallah M., 2019, P 2019 INT C BIG
[2]   Automated cleaning of identity label noise in a large face dataset with quality control [J].
Al Jazaety, Mohamad ;
Guo, Guodong .
IET BIOMETRICS, 2020, 9 (01) :25-30
[3]  
[Anonymous], About Us
[4]  
[Anonymous], 2018, INFORM STUDIES THEOR
[5]   Context-aware data quality assessment for big data [J].
Ardagna, Danilo ;
Cappiello, Cinzia ;
Sama, Walter ;
Vitali, Monica .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 89 :548-562
[6]  
Birodkar V., 2019, Semantic Redundancies in Image-Classification Datasets: The 10% You Dont Need
[7]   Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics [J].
Bors, Christian ;
Gschwandtner, Theresia ;
Kriglstein, Simone ;
Miksch, Silvia ;
Pohl, Margit .
ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2018, 10 (01)
[8]  
[蔡莉 Cai Li], 2020, [软件学报, Journal of Software], V31, P302
[9]  
Chang W, 2022, ISO IEC JTC 1 SC 42
[10]  
Chug S., 2021, STAT LEARNING OPERAT