Benchmark of filter methods for feature selection in high-dimensional gene expression survival data

被引:83
作者
Bommert, Andrea [1 ]
Welchowski, Thomas [2 ]
Schmid, Matthias [2 ]
Rahnenfuehrer, Joerg [1 ]
机构
[1] TU Dortmund Univ, Dept Stat, Vogelpothsweg 87, D-44227 Dortmund, Germany
[2] Univ Bonn, Med Fac, Inst Med Biometry Informat & Epidemiol IMBIE, Bonn, Germany
关键词
benchmark; feature selection; filter methods; high-dimensional data; survival analysis; MICROARRAY DATA; MUTUAL INFORMATION; ALGORITHMS; MODEL; CLASSIFICATION; REGULARIZATION;
D O I
10.1093/bib/bbab354
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Feature selection is crucial for the analysis of high-dimensional data, but benchmark studies for data with a survival outcome are rare. We compare 14 filter methods for feature selection based on 11 high-dimensional gene expression survival data sets. The aim is to provide guidance on the choice of filter methods for other researchers and practitioners. We analyze the accuracy of predictive models that employ the features selected by the filter methods. Also, we consider the run time, the number of selected features for fitting models with high predictive accuracy as well as the feature selection stability. We conclude that the simple variance filter outperforms all other considered filter methods. This filter selects the features with the largest variance and does not take into account the survival outcome. Also, we identify the correlation-adjusted regression scores filter as a more elaborate alternative that allows fitting models with similar predictive accuracy. Additionally, we investigate the filter methods based on feature rankings, finding groups of similar filters.
引用
收藏
页数:13
相关论文
共 79 条
[1]   A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization [J].
Aphinyanaphongs, Yindalon ;
Fu, Lawrence D. ;
Li, Zhiguo ;
Peskin, Eric R. ;
Efstathiadis, Efstratios ;
Aliferis, Constantin F. ;
Statnikov, Alexander .
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (10) :1964-1987
[2]   Feature selection using Joint Mutual Information Maximisation [J].
Bennasar, Mohamed ;
Hicks, Yulia ;
Setchi, Rossitza .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (22) :8520-8532
[3]  
Binder M, 2020, MLR3PIPELINES PREPRO
[4]   Resampling Methods for Meta-Model Validation with Recommendations for Evolutionary Computation [J].
Bischl, B. ;
Mersmann, O. ;
Trautmann, H. ;
Weihs, C. .
EVOLUTIONARY COMPUTATION, 2012, 20 (02) :249-275
[5]  
Bischl B, 2015, J STAT SOFTW, V64, P1
[6]   A review of microarray datasets and applied feature selection methods [J].
Bolon-Canedo, V. ;
Sanchez-Marono, N. ;
Alonso-Betanzos, A. ;
Benitez, J. M. ;
Herrera, F. .
INFORMATION SCIENCES, 2014, 282 :111-135
[7]   A review of feature selection methods on synthetic data [J].
Bolon-Canedo, Veronica ;
Sanchez-Marono, Noelia ;
Alonso-Betanzos, Amparo .
KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (03) :483-519
[8]  
Bommert Andrea, 2020, Machine Learning, Optimization, and Data Science. 6th International Conference, LOD 2020. Revised Selected Papers. Lecture Notes in Computer Science (LNCS 12565), P203, DOI 10.1007/978-3-030-64583-0_19
[9]  
Bommert A.M., 2021, J OPEN SOURCE SOFTW, V6, P3010, DOI [10.21105/joss.03010, DOI 10.21105/JOSS.03010]
[10]  
Bommert A. M., 2020, THESIS TU DORTMUND U