Ensemble-based noise detection: noise ranking and visual performance evaluation

被引:55
作者
Sluban, Borut [1 ,2 ]
Gamberger, Dragan [3 ]
Lavrac, Nada [1 ,2 ]
机构
[1] Jozef Stefan Inst, Ljubljana, Slovenia
[2] Jozef Stefan Int Postgrad Sch, Ljubljana, Slovenia
[3] Rudjer Boskovic Inst, Zagreb, Croatia
关键词
Noise detection; Ensembles; Noise ranking; Precision-recall evaluation; OUTLIER DETECTION; ATTRIBUTE NOISE; ELIMINATION; ALGORITHMS;
D O I
10.1007/s10618-012-0299-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Noise filtering is most frequently used in data preprocessing to improve the accuracy of induced classifiers. The focus of this work is different: we aim at detecting noisy instances for improved data understanding, data cleaning and outlier identification. The paper is composed of three parts. The first part presents an ensemble-based noise ranking methodology for explicit noise and outlier identification, named Noise- Rank, which was successfully applied to a real-life medical problem as proven in domain expert evaluation. The second part is concerned with quantitative performance evaluation of noise detection algorithms on data with randomly injected noise. A methodology for visual performance evaluation of noise detection algorithms in the precision-recall space, named Viper, is presented and compared to standard evaluation practice. The third part presents the implementation of the NoiseRank and Viper methodologies in a web-based platform for composition and execution of data mining workflows. This implementation allows public accessibility of the developed approaches, repeatability and sharing of the presented experiments as well as the inclusion of web services enabling to incorporate new noise detection algorithms into the proposed noise detection and performance evaluation workflows.
引用
收藏
页码:265 / 303
页数:39
相关论文
共 40 条
[1]  
Aggarwal CC, 2001, SIGMOD RECORD, V30, P37
[2]  
[Anonymous], 2003, P 20 INT C MACH LEAR
[3]  
[Anonymous], TECHNICAL REPORT
[4]  
[Anonymous], 2008, Introduction to information retrieval
[5]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[6]   TRANSIENT ST-SEGMENT DEPRESSION AS A MARKER OF MYOCARDIAL ISCHEMIA DURING DAILY LIFE [J].
DEANFIELD, JE ;
SHEA, M ;
RIBIERO, P ;
DELANDSHEERE, CM ;
WILSON, RA ;
HORLOCK, P ;
SELWYN, AP .
AMERICAN JOURNAL OF CARDIOLOGY, 1984, 54 (10) :1195-1200
[7]  
Demsar J, 2004, LECT NOTES ARTIF INT, V3202, P537
[8]  
Frank A., 2010, UCI machine learning repository, V213
[9]   Pruning algorithms for rule learning [J].
Furnkranz, J .
MACHINE LEARNING, 1997, 27 (02) :139-171
[10]  
Gamberger D, 1997, LECT NOTES ARTIF INT, V1224, P108