Cervical Cancer Diagnosis Using Random Forest Classifier With SMOTE and Feature Reduction Techniques

被引:92
作者
Abdoh, Sherif F. [1 ]
Rizka, Mohamed Abo [1 ]
Maghraby, Fahima A. [1 ]
机构
[1] Arab Acad Sci Technol & Maritime Transport, Dept Comp Sci, Cairo 1029, Egypt
关键词
Cervical cancer; random forest; risk factors; SMOTE;
D O I
10.1109/ACCESS.2018.2874063
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cervical cancer is the fourth most common malignant disease in womens worldwide. In most cases cervical cancer symptoms are not noticeable at its early stages. There are a lot of factors that increase the risk of developing cervical cancer like Human Papilloma Virus (HPV), Sexual Transmitted Diseases (STD) and smoking. Identifying those factors and building a classification model to classify whether the cases are cervical cancer or not is a challenging research. This study aims at using cervical cancer risk factors to build classification model using Random Forest (RF) classification technique with Synthetic Minority Oversampling Technique (SMOTE) and two feature reduction techniques Recursive Feature Elimination (RFE) and Principle Component Analysis (PCA). Most medical datasets are often imbalanced because the number of patients is much less than the number of non-patients. Because of the imbalance of the used dataset, SMOTE is used to solve this problem. The dataset consists of 32 risk factors and 4 target variables: Hinselmann, Schiller, Cytology and Biopsy. After comparing the results, we find that the combination of the random forest classification technique with SMOTE improve the classification performance.
引用
收藏
页码:59475 / 59485
页数:11
相关论文
共 27 条
[1]  
[Anonymous], J ONCOL RES TREAT
[2]  
[Anonymous], 2008, RR6729 INRIA
[3]  
[Anonymous], 2018, Cancer facts figures
[4]  
Biau G, 2012, J MACH LEARN RES, V13, P1063
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
Breiman L., 1983, OLSHEN STONE CLASSIF
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]   Combating imbalance in network intrusion datasets [J].
Cieslak, David A. ;
Chawla, Nitesh V. ;
Striegel, Aaron .
2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, :732-+
[9]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[10]  
Fallahi A., 2011, Int J Adv Sci Technol, V34, P65