Mining with noise knowledge: Error-aware data mining

被引:74
作者
Wu, Xindong [1 ,2 ]
Zhu, Xingquan [3 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
[2] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
[3] Florida Atlantic Univ, Dept Comp Sci & Engn, Boca Raton, FL 33431 USA
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS | 2008年 / 38卷 / 04期
基金
中国国家自然科学基金;
关键词
classification; data mining; naive Bayes (NB); noise handling; noise knowledge;
D O I
10.1109/TSMCA.2008.923034
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Real-world data mining deals with noisy information sources where data collection inaccuracy, device limitations, data transmission and discretization errors, or man-made perturbations frequently result in imprecise or vague data. Two common practices are to adopt either data cleansing approaches to enhance the data consistency or simply take noisy data as quality sources and feed them into the data mining algorithms. Either way may substantially sacrifice the mining performance. In this paper, we consider an error-aware (EA) data mining design, which takes advantage of statistical error information (such as noise level and noise distribution) to improve data mining results. We assume that such noise knowledge is available in advance, and we propose a solution to incorporate it into the mining process. More specifically, we use noise knowledge to restore original data distributions, which are further used to rectify the model built from noise-corrupted data. We materialize this concept by the proposed EA naive Bayes classification algorithm. Experimental comparisons on real-world datasets will demonstrate the effectiveness of this design.
引用
收藏
页码:917 / 932
页数:16
相关论文
共 49 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]  
[Anonymous], 2000, Privacy-preserving data mining, DOI DOI 10.1145/342009.335438
[3]  
Beaumont J.-F., 2000, P SURV RES METH SECT, P580
[4]  
Berry M., 1999, MASTERING DATA MININ
[5]  
Blake C.L., 1998, UCI repository of machine learning databases
[6]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[7]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[8]  
Chapman A.D., 2005, PRINCIPLES METHODS D
[9]  
Chapman P., 2000, CRISP DM 1 0 STEP BY, V9, P1
[10]  
COPPOLA L, 2000, P DATACLEAN C, P30