CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

被引:177
作者
Ma, Li [1 ]
Fan, Suohai [1 ]
机构
[1] Jinan Univ, Sch Informat Sci & Technol, Guangzhou 510632, Guangdong, Peoples R China
来源
BMC BIOINFORMATICS | 2017年 / 18卷
基金
中国国家自然科学基金;
关键词
Random forests; Imbalance data; Intelligence algorithm; Feature selection; Parameter optimization; IMBALANCED DATA; AMINO-ACIDS; CLASSIFICATION; PREDICTION; ENSEMBLE; CANCER; BREAST;
D O I
10.1186/s12859-017-1578-z
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. Results: We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. Conclusion: The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.
引用
收藏
页数:18
相关论文
共 79 条
[1]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[2]   Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data [J].
Ali, Safdar ;
Majid, Abdul ;
Javed, Syed Gibran ;
Sattar, Mohsin .
COMPUTERS IN BIOLOGY AND MEDICINE, 2016, 73 :38-46
[3]  
[Anonymous], 2007, Acta Electronica Sinica, V35, P2161
[4]  
[Anonymous], 2011, J METALL, DOI DOI 10.1109/ICBBE.2011.5781379
[5]  
Batista GE., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI DOI 10.1145/1007730.1007735
[6]  
Bernard S, 2009, P 8 INT WORKSH MULT, P171
[7]   SMOTE for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2013, 14
[8]   Class prediction for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2010, 11 :523
[9]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32