Enabling Smart Data: Noise filtering in Big Data classification

被引:89
作者
Garcia-Gil, Diego [1 ]
Luengo, Julian [1 ]
Garcia, Salvador [1 ]
Herrera, Francisco [1 ,2 ]
机构
[1] Univ Granada, Dept Comp Sci & Artificial Intelligence, E-18071 Granada, Spain
[2] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah 21589, Saudi Arabia
关键词
Big Data; Smart Data; Classification; Class noise; Label noise; CHALLENGES; MAPREDUCE; ENSEMBLE; TRENDS;
D O I
10.1016/j.ins.2018.12.002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem. (C) 2018 Elsevier Inc. All rights reserved.
引用
收藏
页码:135 / 152
页数:18
相关论文
共 49 条
[1]  
[Anonymous], KNOWLEDGE ACQUISITIO
[2]  
[Anonymous], 2016, The Journal of Machine Learning Research, DOI DOI 10.1145/2882903.2912565
[3]   Searching for exotic particles in high-energy physics with deep learning [J].
Baldi, P. ;
Sadowski, P. ;
Whiteson, D. .
NATURE COMMUNICATIONS, 2014, 5
[4]  
Benavoli A, 2017, J MACH LEARN RES, V18
[5]   Robust supervised classification with mixture models: Learning from data with uncertain labels [J].
Bouveyron, Charles ;
Girard, Stephane .
PATTERN RECOGNITION, 2009, 42 (11) :2649-2658
[6]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[7]   rNPBST: An R Package Covering Non-parametric and Bayesian Statistical Tests [J].
Carrasco, Jacinto ;
Garcia, Salvador ;
del Mar Rueda, Maria ;
Herrera, Francisco .
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2017, 2017, 10334 :281-292
[8]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[9]   Data-intensive applications, challenges, techniques and technologies: A survey on Big Data [J].
Chen, C. L. Philip ;
Zhang, Chun-Yang .
INFORMATION SCIENCES, 2014, 275 :314-347
[10]  
Chen Jingliang., 2017, ADV IN BIG DATA, P283, DOI DOI 10.1007/978-3-319-47898-2_29