Ensembles of label noise filters: a ranking approach

被引:31
作者
Garcia, Luis P. F. [1 ]
Lorena, Ana C. [2 ]
Matwin, Stan [3 ,4 ]
de Carvalho, Andre C. P. L. F. [1 ]
机构
[1] Univ Sao Paulo, Inst Ciencias Matemat & Comp, Trabalhador Sao Carlense Ave 400, Sao Paulo, Brazil
[2] Univ Fed Sao Paulo, Inst Ciencia & Tecnol, Talim St 330, Sao Paulo, Brazil
[3] Dalhousie Univ, Inst Big Data Analyt, Univ Ave 6050, Halifax, NS, Canada
[4] Polish Acad Sci, Inst Comp Sci, Warsaw, Poland
基金
巴西圣保罗研究基金会; 加拿大自然科学与工程研究理事会;
关键词
Label noise; Noise filters; Ensemble filters; Noise ranking; Recommendation system; CLASSIFICATION; PERFORMANCE;
D O I
10.1007/s10618-016-0475-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Label noise can be a major problem in classification tasks, since most machine learning algorithms rely on data labels in their inductive process. Thereupon, various techniques for label noise identification have been investigated in the literature. The bias of each technique defines how suitable it is for each dataset. Besides, while some techniques identify a large number of examples as noisy and have a high false positive rate, others are very restrictive and therefore not able to identify all noisy examples. This paper investigates how label noise detection can be improved by using an ensemble of noise filtering techniques. These filters, individual and ensembles, are experimentally compared. Another concern in this paper is the computational cost of ensembles, once, for a particular dataset, an individual technique can have the same predictive performance as an ensemble. In this case the individual technique should be preferred. To deal with this situation, this study also proposes the use of meta-learning to recommend, for a new dataset, the best filter. An extensive experimental evaluation of the use of individual filters, ensemble filters and meta-learning was performed using public datasets with imputed label noise. The results show that ensembles of noise filters can improve noise filtering performance and that a recommendation system based on meta-learning can successfully recommend the best filtering technique for new datasets. A case study using a real dataset from the ecological niche modeling domain is also presented and evaluated, with the results validated by an expert.
引用
收藏
页码:1192 / 1216
页数:25
相关论文
共 49 条
[1]  
[Anonymous], 1999, Technometrics, DOI DOI 10.2307/1269742
[2]  
[Anonymous], 2000, P 17 INT C MACH LEAR
[3]  
[Anonymous], 2009, METALEARNING APPL DA, DOI DOI 10.1007/978-3-540-73263-1
[4]  
Bache K., 2013, UCI Machine Learning Repository
[5]  
Bensusan H., 2000, ILP Work-in-progress Reports, V35, P33
[6]  
Bischl B., 2015, 2015 INT JOINT C NEU, P1
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[9]  
Brown G., 2011, Encyclopedia of machine learning, P312, DOI [10.1007/978-0-387-30164-8252, DOI 10.1007/978-0-387-30164-8252, 10.1007/978-0-387-30164-8_252, DOI 10.1007/978-0-387-30164-8_252]
[10]  
de Souza BF, 2010, LECT NOTES ARTIF INT, V6433, P194