Identifying and handling mislabelled instances

被引：113

作者：

Muhlenbach, F ^{[1
]}

Lallich, S ^{[1
]}

Zighed, DA ^{[1
]}

机构：

[1] Univ Lyon 2, ERIC Lab, F-69676 Bron, France

来源：

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS | 2004年 / 22卷 / 01期

关键词：

supervised learning; mislabelled data; geometrical neighbourhood; filtering; removing instances; relabelling instances;

D O I：

10.1023/A:1025832930864

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.

引用

页码：89 / 109

页数：21

共 32 条

[1] AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2] Barnett V., 1984, WILEY SERIES PROBABI
[3] BECKMAN RJ, 1983, TECHNOMETRICS, V25, P119, DOI 10.2307/1268541
[4] Blake C.L., 1998, UCI repository of machine learning databases
[5] Identifying mislabeled training data
Brodley, CE
Friedl, MA
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 : 131 - 167
[6] Brodley CE, 1996, PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, P799
[7] Cliff A. D., 1981, SPATIAL PROCESSES MO
[8] NEAREST NEIGHBOR PATTERN CLASSIFICATION
COVER, TM
HART, PE
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) : 21 - +
[9] SOME PROPERTIES OF STOCHASTIC LABELING PROCEDURES
ELFVING, T
EKLUNDH, JO
[J]. COMPUTER GRAPHICS AND IMAGE PROCESSING, 1982, 20 (02): : 158 - 170
[10] ON THE FOUNDATIONS OF RELAXATION LABELING PROCESSES
HUMMEL, RA
ZUCKER, SW
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1983, 5 (03) : 267 - 287

← 1 2 3 4 →