A Survey on Classifying Big Data with Label Noise

被引:15
作者
Johnson, Justin M. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, POB 1212, Boca Raton, FL 33431 USA
来源
ACM JOURNAL OF DATA AND INFORMATION QUALITY | 2022年 / 14卷 / 04期
关键词
Label noise; data quality; big data; machine learning; classification; deep learning; data streams; CLASSIFICATION; ROBUST;
D O I
10.1145/3492546
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for newand improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
引用
收藏
页数:43
相关论文
共 136 条
[1]   INSTANCE-BASED LEARNING ALGORITHMS [J].
AHA, DW ;
KIBLER, D ;
ALBERT, MK .
MACHINE LEARNING, 1991, 6 (01) :37-66
[2]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]   Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity [J].
Anderson, Blake ;
McGrew, David .
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :1723-1732
[4]  
[Anonymous], 2022, LANCET, V14
[5]  
[Anonymous], 2022, WATER AIR SOIL POLL, V14
[6]  
[Anonymous], 2020, Google Scholar
[7]  
[Anonymous], 2011, P 49 ANN M ASS COMP
[8]  
Apache Software Foundation, 2020, Hadoop
[9]  
Arpit Devansh, 2017, PMLR, P233
[10]  
Barandela R, 2000, LECT NOTES COMPUT SC, V1876, P621