Class noise vs. attribute noise: A quantitative study of their impacts

被引:628
作者
Zhu, XQ [1 ]
Wu, XD [1 ]
机构
[1] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
关键词
attribute noise; class noise; machine learning; noise impacts;
D O I
10.1007/s10462-004-0751-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created front the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ sonic preprocessing mechanisms to handle noisy instances before a learner is formed. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.
引用
收藏
页码:177 / 210
页数:34
相关论文
共 52 条
[1]  
Allison PD, 2010, HANDBOOK OF SURVEY RESEARCH, 2ND EDITION, P631
[2]  
BANSAL N, 2000, ERROR CORRECTION NOI
[3]  
Batista GEAPA, 2003, APPL ARTIF INTELL, V17, P519, DOI [10.1080/713827181, 10.1080/08839510390219309]
[4]  
Blake C.L., 1998, UCI repository of machine learning databases
[5]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[6]  
Brodley CE, 1996, PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, P799
[7]   Comparison of various routines for unknown attribute value processing: The covering paradigm [J].
Bruha, I ;
Franek, F .
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 1996, 10 (08) :939-955
[8]  
Bruha I, 2002, LECT NOTES ARTIF INT, V2366, P451
[9]   PRISM - AN ALGORITHM FOR INDUCING MODULAR RULES [J].
CENDROWSKA, J .
INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1987, 27 (04) :349-370
[10]   PRO-OPIOMELANOCORTIN MESSENGER-RNA SIZE HETEROGENEITY IN ACTH-DEPENDENT CUSHINGS-SYNDROME [J].
CLARK, AJL ;
LAVENDER, PM ;
BESSER, GM ;
REES, LH .
JOURNAL OF MOLECULAR ENDOCRINOLOGY, 1989, 2 (01) :3-9