A Breast Cancer Diagnosis Method Based on VIM Feature Selection and Hierarchical Clustering Random Forest Algorithm

被引:27
作者
Huang, Zexian [1 ]
Chen, Daqi [1 ]
机构
[1] Guangzhou Univ, Sch Mech & Elect Engn, Guangzhou 510006, Peoples R China
基金
中国国家自然科学基金;
关键词
Random forests; Decision trees; Breast cancer; Feature extraction; Cancer; Training; Databases; hierarchical clustering random forest algorithm; feature selection; GENETIC ALGORITHM; PREDICTION; PROGNOSIS;
D O I
10.1109/ACCESS.2021.3139595
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Breast cancer is a neoplastic disease which seriously threatens women's health. It is regard as the most common cause of cancer death in women. Accurate detection and effective treatment are of vital significance to lower the death rate of breast cancer. In recent years, machine learning technique has been considered to be an effective method for accurate diagnosis of various diseases, among which Random Forest (RF) has been widely applied. However, decision trees with poor classification performance and high similarity may be generated during the training process, which affects the overall classification performance of the model. In this paper, a Hierarchical Clustering Random Forest (HCRF) model is developed. By measuring the similarity among all the decision trees, the hierarchical clustering technique is used to carry out clustering analysis on decision trees. The representative trees are selected from divided clusters to construct the hierarchical clustering random forest with low similarity and high accuracy. In addition, we use Variable Importance Measure (VIM) method to optimize the selected feature number for the breast cancer prediction. Wisconsin Diagnosis Breast Cancer (WDBC) database and Wisconsin Breast Cancer (WBC) database from the UCI (University of California Irvine) Machine Learning repository are employed in this study. The performance of the proposed method is evaluated by utilizing accuracy, precision, sensitivity, specificity and AUC (Area Under ROC Curve). Experimental results indicate that the classification based on HCRF algorithm with VIM as a feature selection method reaches the best accuracy of 97.05% and 97.76% compared to Decision Tree, Adaboost and Random Forest on both the WDBC and WBC datasets. The method proposed in this study is an effective tool for diagnosing breast cancer.
引用
收藏
页码:3284 / 3293
页数:10
相关论文
共 55 条
[1]  
Aalaei S, 2016, IRAN J BASIC MED SCI, V19, P476
[2]  
Alsaeedi A. H., ARXIV200803530
[3]  
Amaricai A, 2017, STUD INFORM CONTROL, V26, P43
[4]  
[Anonymous], 2016, ALZHEIMERS DEMENT
[5]  
[Anonymous], 2021, IEEE Trans. Broadcast.
[6]   Heterogeneous classifiers fusion for dynamic breast cancer diagnosis using weighted vote based ensemble [J].
Bashir, Saba ;
Qamar, Usman ;
Khan, Farhan Hassan .
QUALITY & QUANTITY, 2015, 49 (05) :2061-2076
[7]   Hierarchical clustering for histogram data [J].
Billard L. ;
Kim J. .
Wiley Interdisciplinary Reviews: Computational Statistics, 2017, 9 (05)
[8]   A review of feature selection methods on synthetic data [J].
Bolon-Canedo, Veronica ;
Sanchez-Marono, Noelia ;
Alonso-Betanzos, Amparo .
KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (03) :483-519
[9]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[10]  
Cawley GC, 2010, J MACH LEARN RES, V11, P2079