On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

被引:11
作者
Huang, Min-Wei [1 ,2 ,3 ]
Chiu, Chien-Hung [4 ]
Tsai, Chih-Fong [5 ]
Lin, Wei-Chao [4 ,6 ]
机构
[1] China Med Univ, Dept Phys Therapy, Taichung 406040, Taiwan
[2] China Med Univ, Grad Inst Rehabil Sci, Taichung 406040, Taiwan
[3] Taichung Vet Gen Hosp, Dept Psychiat, Chiayi Branch, Chiayi 60090, Taiwan
[4] Chang Gung Mem Hosp, Dept Thorac Surg, Linkou 333423, Taiwan
[5] Natl Cent Univ, Dept Informat Management, Taoyuan 320317, Taiwan
[6] Chang Gung Univ, Dept Informat Management, Taoyuan 33302, Taiwan
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 14期
关键词
breast cancer; data mining; machine learning; feature selection; over-sampling; class imbalance; SUPPORT VECTOR MACHINE; CLASSIFICATION; DIAGNOSIS; NETWORK;
D O I
10.3390/app11146574
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.
引用
收藏
页数:9
相关论文
共 29 条
[1]   Breast cancer diagnosis using GA feature selection and Rotation Forest [J].
Alickovic, Emina ;
Subasi, Abdulhamit .
NEURAL COMPUTING & APPLICATIONS, 2017, 28 (04) :753-763
[2]  
Aydiner A., 2019, Breast Cancer A Guide to Clinical Practice: A Guide to Clinical Practice
[3]   Performance Analysis of Feature Selection Methods in Software Defect Prediction: A Search Method Approach [J].
Balogun, Abdullateef Oluwagbemiga ;
Basri, Shuib ;
Abdulkadir, Said Jadid ;
Hashim, Ahmad Sobri .
APPLIED SCIENCES-BASEL, 2019, 9 (13)
[4]  
Cai Tongan., 2018, Appl Comput Math, V7, P146, DOI DOI 10.11648/J.ACM.20180703.20
[5]   A survey on feature selection methods [J].
Chandrashekar, Girish ;
Sahin, Ferat .
COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) :16-28
[6]  
Dash M., 1997, Intelligent Data Analysis, V1
[7]   Data preprocessing for anomaly based network intrusion detection: A review [J].
Davis, Jonathan J. ;
Clark, Andrew J. .
COMPUTERS & SECURITY, 2011, 30 (6-7) :353-375
[8]   SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary [J].
Fernandez, Alberto ;
Garcia, Salvador ;
Herrera, Francisco ;
Chawla, Nitesh V. .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2018, 61 :863-905
[9]   A comprehensive data level analysis for cancer diagnosis on imbalanced data [J].
Fotouhi, Sara ;
Asadi, Shahrokh ;
Kattan, Michael W. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 90
[10]   A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches [J].
Galar, Mikel ;
Fernandez, Alberto ;
Barrenechea, Edurne ;
Bustince, Humberto ;
Herrera, Francisco .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (04) :463-484