Stopping rules for mutual information-based feature selection

被引:19
作者
Mielniczuk, Jan [1 ,2 ]
Teisseyre, Pawel [1 ]
机构
[1] Polish Acad Sci, Inst Comp Sci, Jana Kazimierza 5, PL-01248 Warsaw, Poland
[2] Warsaw Univ Technol, Fac Math & Informat Sci, Warsaw, Poland
关键词
Entropy; Mutual information; Interaction information; Feature selection; Multiple hypothesis testing; Stopping rules; REGRESSION; FRAMEWORK;
D O I
10.1016/j.neucom.2019.05.048
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years feature selection methods based on mutual information have attracted a significant attention. Most of the proposed methods are based on sequential forward search which add at each step a feature which is most relevant in explaining a class variable when considered together with the already chosen features. Such procedures produce ranking of features ordered according to their relevance. However significant limitation of all existing methods is lack of stopping rules which separate relevant features placed on the top of the ranking list from irrelevant ones. Finding an appropriate stopping rule is particularly important in domains where one wants to precisely determine the set of features affecting the class variable and discard the irrelevant ones (e.g. in genome-wide association studies the goal is to precisely determine mutations in DNA affecting the disease). In this work we propose stopping rules which are based on distribution of approximation of conditional mutual information given that all relevant features have been already selected. We show that the distribution is approximately chi square with appropriate number of degrees of freedom provided features are discretized into moderate number of bins. The proposed stopping rules are based on quantiles of the distribution and related p-values which are compared with thresholds used in multiple hypothesis testing. Importantly the proposed methods do not require additional validation data and are independent from the classifier. The extensive simulation experiments indicate that the rules separate relevant features from the irrelevant ones. We show experimentally that Positive Selection Rate (fraction of features correctly selected as relevant with respect to all relevant features) approaches 1, when sample size increases. At the same time, False Discovery Rate (fraction of irrelevant features selected with respect to all selected features) is controlled. The experiments on 17 benchmark datasets indicate that the classification models, built on features selected by the proposed methods, in 13 cases achieve significantly higher accuracy than the models based on all available features. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:255 / 274
页数:20
相关论文
共 39 条
[1]  
Agresti A., 2013, Categorical data analysis, V3rd ed.
[2]  
[Anonymous], MULTIPLE TESTING PRO
[3]  
[Anonymous], 2009, Approximation theorems of mathematical statistics
[4]  
[Anonymous], NIPS
[5]  
[Anonymous], 2019, Statistical Learning with Sparsity: The Lasso and Generalizations
[6]  
[Anonymous], 2003, MATH STAT
[7]  
[Anonymous], ENTROPY
[8]   USING MUTUAL INFORMATION FOR SELECTING FEATURES IN SUPERVISED NEURAL-NET LEARNING [J].
BATTITI, R .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (04) :537-550
[9]  
Brown G, 2012, J MACH LEARN RES, V13, P27
[10]   Likelihood Ratio Tests With Three-Way Tables [J].
Cheng, Philip E. ;
Liou, Michelle ;
Aston, John A. D. .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2010, 105 (490) :740-749