The feature selection bias problem in relation to high-dimensional gene data

被引:56
作者
Krawczuk, Jerzy [1 ]
Lukaszuk, Tomasz [1 ]
机构
[1] Bialystok Tech Univ, Fac Comp Sci, 45A Wiejska St, PL-15351 Bialystok, Poland
关键词
Feature selection bias; Convex and piecewise linear classifier; Support vector machine; Gene selection; Microarray data; EXPRESSION PATTERNS; CLASSIFICATION; MICROARRAY; CANCER; PREDICTION; TUMOR;
D O I
10.1016/j.artmed.2015.11.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: Feature selection is a technique widely used in data mining. The aim is to select the best subset of features relevant to the problem being considered. In this paper, we consider feature selection for the classification of gene datasets. Gene data is usually composed of just a few dozen objects described by thousands of features. For this kind of data, it is easy to find a model that fits the learning data. However, it is not easy to find one that will simultaneously evaluate new data equally well as learning data. This overfitting issue is well known as regards classification and regression, but it also applies to feature selection. Methods and materials: We address this problem and investigate its importance in an empirical study of four feature selection methods applied to seven high-dimensional gene datasets. We chose datasets that are well studied in the literature colon cancer, leukemia and breast cancer. All the datasets are characterized by a significant number of features and the presence of exactly two decision classes. The feature selection methods used are ReliefF, minimum redundancy maximum relevance, support vector machine recursive feature elimination and relaxed linear separability. Results: Our main result reveals the existence of positive feature selection bias in all 28 experiments (7 datasets and 4 feature selection methods). Bias was calculated as the difference between validation and test accuracies and ranges from 2.6% to as much as 41.67%. The validation accuracy (biased accuracy) was calculated on the same dataset on which the feature selection was performed. The test accuracy was calculated for data that was not used for feature selection (by so called external cross-validation). Conclusions: This work provides evidence that using the same dataset for feature selection and learning is not appropriate. We recommend using cross-validation for feature selection in order to reduce selection bias. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:63 / 71
页数:9
相关论文
共 41 条
[1]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[2]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[3]  
[Anonymous], 2007, ARTIFICIAL INTELLIGE
[4]  
[Anonymous], 2005, DATA MINING BASED CO
[5]  
Bellman R., 1961, Adaptive Control Processes: A Guided Tour, DOI DOI 10.1515/9781400874668
[6]  
Bobrowski L, 2004, LECT NOTES ARTIF INT, V3070, P544
[7]   A METHOD OF SYNTHESIS OF LINEAR DISCRIMINANT FUNCTION IN THE CASE OF NONSEPARABILITY [J].
BOBROWSKI, L ;
NIEMIRO, W .
PATTERN RECOGNITION, 1984, 17 (02) :205-210
[8]   DESIGN OF PIECEWISE LINEAR CLASSIFIERS FROM FORMAL NEURONS BY A BASIS EXCHANGE TECHNIQUE [J].
BOBROWSKI, L .
PATTERN RECOGNITION, 1991, 24 (09) :863-870
[9]  
Bobrowski L, 2011, SELECTED WORKS BIOIN
[10]  
Bobrowski L, 7 ICB SEM STAT CLIN