Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data

被引:23
作者
Zhong, Yi [1 ]
Chalise, Prabhakar [1 ]
He, Jianghua [1 ]
机构
[1] Univ Kansas, Med Ctr, Dept Biostat & Data Sci, 3901 Rainbow Blvd, Kansas City, KS 66160 USA
关键词
Area under ROC; Cross-validation; Elastic net; Ensemble learning; Random forest; Support vector machine;
D O I
10.1080/03610918.2020.1850790
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In recent years, application of feature selection methods in biological datasets has greatly increased. By using feature selection techniques, a subset of relevant informative features is obtained which results in more interpretable model improving the prediction accuracy. In addition, ensemble learning can further provide a more robust model by combining the results of multiple statistical learning models. We propose an algorithm that uses ensemble methods to select the features and build the classification model with selected features. Our proposed approach is a two-step and two-layer cross-validation method. The first step performs the feature selection in the inner loop of cross-validation, whereas the second step builds the classification model in the outer loop of cross-validation. The final classification model, obtained by using the proposed method, has a higher prediction accuracy than that using the standard cross-validation. The applications of the proposed method have been presented using both simulated and three real datasets.
引用
收藏
页码:110 / 125
页数:16
相关论文
共 39 条
[1]   Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification [J].
Al-Thanoon, Niam Abdulmunim ;
Qasim, Omar Saber ;
Algamal, Zakariya Yahya .
COMPUTERS IN BIOLOGY AND MEDICINE, 2018, 103 :262-268
[2]   A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification [J].
Algamal, Zakariya Yahya ;
Lee, Muhammad Hisyam .
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (03) :753-771
[3]   Gene selection for microarray gene expression classification using Bayesian Lasso quantile regression [J].
Algamal, Zakariya Yahya ;
Alhamzawi, Rahim ;
Ali, Haithem Taha Mohammad .
COMPUTERS IN BIOLOGY AND MEDICINE, 2018, 97 :145-152
[4]  
Algamal ZY, 2017, ELECTRON J APPL STAT, V10, P242, DOI 10.1285/i20705948v10n1p242
[5]   Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification [J].
Algamal, Zakariya Yahya ;
Lee, Muhammad Hisyam .
COMPUTERS IN BIOLOGY AND MEDICINE, 2015, 67 :136-145
[6]   Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification [J].
Algamal, Zakariya Yahya ;
Lee, Muhammad Hisyam .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (23) :9326-9332
[7]  
Alharthi AM, 2020, International Journal on Advanced Science Engineering and Information Technology, V10, P1483, DOI [10.18517/ijaseit.10.4.10907, 10.18517/ijaseit.10.4.10907, DOI 10.18517/IJASEIT.10.4.10907]
[8]   Tissue classification with gene expression profiles [J].
Ben-Dor, A ;
Bruhn, L ;
Friedman, N ;
Nachman, I ;
Schummer, M ;
Yakhini, Z .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (3-4) :559-583
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]  
Breiman L., 1998, ANN STAT, V26, P801, DOI [10.1214/aos/1024691079, DOI 10.1214/AOS/1024691079]