Biases in feature selection with missing data

被引:16
作者
Seijo-Pardo, Borja [1 ]
Alonso-Betanzos, Amparo [1 ]
Bennett, Kristin P. [2 ]
Bolon-Canedo, Veronica [1 ]
Josse, Julie [3 ]
Saeed, Mehreen [4 ]
Guyon, Isabelle [5 ]
机构
[1] Univ A Coruna, CITIC, La Coruna 15006, Spain
[2] Rensselaer Polytech Inst, Troy, NY 12180 USA
[3] Ecole Polytech, CMAP, F-91128 Palaiseau, France
[4] Natl Univ Comp & Emerging Sci, FAST, Lahore 54000, Pakistan
[5] Univ Paris Saclay, UPSud, INRIA, F-91405 Orsay, France
关键词
Feature selection; Missing data; De-biased t-test; MULTIPLE IMPUTATION; MUTUAL INFORMATION;
D O I
10.1016/j.neucom.2018.10.085
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is of great importance for two possible scenarios: (1) prediction, i.e., improving (or minimally degrading) the predictions of a target variable while discarding redundant or uninformative features and (2) discovery, i.e., identifying features that are truly dependent on the target and may be genuine causes to be determined in experimental verifications (for example for the task of drug target discovery in genomics). In both cases, if variables have a large number of missing values, imputing them may lead to false positives; features that are not associated with the target become dependent as a result of imputation. In the first scenario, this may not harm prediction, but in the second one, it will erroneously select irrelevant features. In this paper, we study the risk/benefit trade-off of missing value imputation in the context of feature selection, using causal graphs to characterize when structural bias arises. Our aim is also to investigate situations in which imputing missing values may be beneficial to reduce false negatives, a situation that might arise when there is a dependency between feature and target, but the dependency is below the significance level when only complete cases are considered. However, the benefits of reducing false negatives must be balanced against the increased number of false positives. In the case of binary target variable and continuous features, the t-test is often used for univariate feature selection. In this paper, we also introduce a de-biased version of the t-test allowing us to reap the benefits of imputation, while not incurring the penalty of increasing the number of false positives. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:97 / 112
页数:16
相关论文
共 41 条
[1]  
[Anonymous], 2006, STUDIES FUZZINESS SO
[2]  
[Anonymous], J STAT SOFTWARE
[3]  
[Anonymous], 2002, MISSING DATA
[4]  
[Anonymous], 2014, STAT ANAL MISSING DA
[5]  
[Anonymous], 2007, Missing Data: A Gentle Introduction
[6]  
[Anonymous], 2007, P KDD CUP WORKSH
[7]  
[Anonymous], P AMSTAT
[8]  
[Anonymous], FEATURE SELECTION HI
[9]  
[Anonymous], 2001, SURVEY METHODOLOGY
[10]  
[Anonymous], 2004, Advances in Neural Information Processing Systems