Stability of filter feature selection methods in data pipelines: a simulation study

被引:3
作者
Bertolini, Roberto [1 ]
Finch, Stephen J. [1 ]
机构
[1] Stony Brook Univ SUNY, Dept Appl Math & Stat, Stony Brook, NY 11794 USA
关键词
Stability; Filter methods; Feature selection; Data pipelines; Binary classification; Machine learning; Monte Carlo simulation; Data science; CLASSIFICATION; RANKING; MACHINE;
D O I
10.1007/s41060-022-00373-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Filter methods are a class of feature selection techniques used to identify a subset of informative features during data preprocessing. While the differential efficacy of these techniques has been extensively compared in data science pipelines for predictive outcome modeling, less work has examined how their stability is impacted by underlying corpora properties. A set of six stability metrics (Davis, Dice, Jaccard, Kappa, Lustgarten, and Novovicova) was compared during cross-validation in a Monte Carlo simulation study on synthetic data to examine variability in the stability of three filter methods in data pipelines for binary classification, considering five underlying data properties: (1) error of measurement in the independent covariates, (2) number of training observations, (3) number of features, (4) class imbalance magnitude, and (5) missing data pattern. Feature selection stability was platykurtic and was negatively impacted by measurement error and a smaller number of training observations included in the input corpora. The Novovicova stability metric yielded the highest mean stability values, while the Davis stability metric was the most unstable method. The distribution of all stability metrics was negatively skewed, and the Jaccard metric exhibited the largest amount of variability across all five data properties. A statistical analysis of the synergistic effects between filter feature selection techniques, filter cutoffs, data corpora properties, and machine learning (ML) algorithms on overall pipeline efficacy, quantified using the area the under curve (AUC) evaluation metric, is also presented and discussed.
引用
收藏
页码:225 / 248
页数:24
相关论文
共 113 条
[1]  
Alelyani S., 2013, THESIS ARIZONA STATE
[2]  
Alexandro D., 2018, THESIS U CONNECTICUT
[3]  
Almutiri Talal, 2022, International Journal of Data Mining, Modelling and Management, V14, P41, DOI 10.1504/IJDMMM.2022.122038
[4]   A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization [J].
Aphinyanaphongs, Yindalon ;
Fu, Lawrence D. ;
Li, Zhiguo ;
Peskin, Eric R. ;
Efstathiadis, Efstratios ;
Aliferis, Constantin F. ;
Statnikov, Alexander .
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (10) :1964-1987
[5]   Sensitivity Analysis of the Composite Data-Driven Pipelines in the Automated Machine Learning [J].
Barabanova, Irina, V ;
Vychuzhanin, Pavel ;
Nikitin, Nikolay O. .
10TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE IN COMPUTATIONAL SCIENCE (YSC2021), 2021, 193 :484-493
[6]  
Belanche, 2011, ARXIV
[7]  
Berens J., 2018, J ED DATA MIN, V11, P1, DOI DOI 10.5281/ZENODO.3594771
[8]  
Bertolini R., 2022, Computers and Education: Artificial Intelligence, V3, P100067, DOI [10.1016/j.caeai.2022.100067, DOI 10.1016/J.CAEAI.2022.100067]
[9]  
Bertolini R., 2021, THESIS STONY BROOK U
[10]  
Bertolini R, 2022, INT J DATA MIN MODEL, V14, P217, DOI [10.1504/IJDMMM.2022.10050164, 10.1504/IJDMMM.2022.125261]