A 'non-parametric' version of the naive Bayes classifier

被引:108
作者
Soria, Daniele [1 ]
Garibaldi, Jonathan M. [1 ]
Ambrogi, Federico [2 ]
Biganzoli, Elia M. [2 ]
Ellis, Ian O. [3 ,4 ]
机构
[1] Univ Nottingham, Sch Comp Sci, Nottingham NG8 1BB, England
[2] Univ Milan, Inst Med Stat & Biometry, I-20133 Milan, Italy
[3] Univ Nottingham Hosp, Sch Mol Med Sci, Nottingham NG7 2UH, England
[4] Univ Nottingham, Queens Med Ctr, Nottingham NG7 2UH, England
关键词
Supervised learning; Naive Bayes; Logistic regression; Breast cancer; UCI data sets; INCOMPLETE DATA; BREAST;
D O I
10.1016/j.knosys.2011.02.014
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be normally distributed, but retains the essential structure and other underlying assumptions of the method. we tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-normal distributions are observed. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:775 / 784
页数:10
相关论文
共 36 条
[1]   High-throughput protein expression analysis using tissue microarray technology of a large well-characterised series identifies biologically distinct classes of breast cancer confirming recent cDNA expression analyses [J].
Abd El-Rehim, DM ;
Ball, G ;
Pinder, SE ;
Rakha, E ;
Paish, C ;
Robertson, JFR ;
Macmillan, D ;
Blamey, RW ;
Ellis, IO .
INTERNATIONAL JOURNAL OF CANCER, 2005, 116 (03) :340-350
[2]  
[Anonymous], 1995, 12 INT C MACH LEARN
[3]  
[Anonymous], 1987, Turing Institute Research Memorandum TIRM-87-0.18
[4]  
[Anonymous], 2007, Uci machine learning repository
[5]   NB+: An improved Naive Bayesian algorithm [J].
Appavu alias Balamurugan ;
Rajaram, Ramasamy ;
Pramala, S. ;
Rajalakshmi, S. ;
Jeyendran, C. ;
Prakash, J. Dinesh Surya .
KNOWLEDGE-BASED SYSTEMS, 2011, 24 (05) :563-569
[6]  
BOUCKAERT R, 2004, P 17 AUSTR C AI AI04
[7]   A selective Bayes Classifier for classifying incomplete data based on gain ratio [J].
Chen, Jingnian ;
Huang, Houkuan ;
Tian, Fengzhan ;
Tian, Shengfeng .
KNOWLEDGE-BASED SYSTEMS, 2008, 21 (07) :530-534
[8]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[9]  
Evett I. W., 1987, KBS in Government. Proceedings of the Conference, P107
[10]   THE NOTTINGHAM PROGNOSTIC INDEX IN PRIMARY BREAST-CANCER [J].
GALEA, MH ;
BLAMEY, RW ;
ELSTON, CE ;
ELLIS, IO .
BREAST CANCER RESEARCH AND TREATMENT, 1992, 22 (03) :207-219