Investigation on particle swarm optimisation for feature selection on high-dimensional data: local search and selection bias

被引:45
作者
Binh Tran [1 ]
Xue, Bing [1 ]
Zhang, Mengjie [1 ]
Su Nguyen [1 ]
机构
[1] Victoria Univ Wellington, Sch Engn & Comp Sci, POB 600, Wellington 6140, New Zealand
关键词
Feature selection; particle swarm optimisation; high-dimensional data; classification; GENETIC ALGORITHM; CLASSIFICATION; SIMILARITY; MODELS;
D O I
10.1080/09540091.2016.1185392
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is an essential step in classification tasks with a large number of features, such as in gene expression data. Recent research has shown that particle swarm optimisation (PSO) is a promising approach to feature selection. However, it also has potential limitation to get stuck into local optima, especially for gene selection problems with a huge search space. Therefore, we developed a PSO algorithm (PSO-LSRG) with a fast local search combined with a gbest resetting mechanism as a way to improve the performance of PSO for feature selection. Furthermore, since many existing PSO-based feature selection approaches on the gene expression data have feature selection bias, i.e. no unseen test data is used, 2 sets of experiments on 10 gene expression datasets were designed: with andwithout feature selection bias. As compared to standard PSO, PSO with gbest resetting only, and PSO with local search only, PSO-LSRG obtained a substantial dimensionality reduction and a significant improvement on the classification performance in both sets of experiments. PSO-LSRG outperforms the other three algorithms when feature selection bias exists. When there is no feature selection bias, PSO-LSRG selects the smallest number of features in all cases, but the classification performance is slightly worse in a few cases, which may be caused by the overfitting problem. This shows that feature selection bias should be avoided when designing a feature selection algorithm to ensure its generalisation ability on unseen data.
引用
收藏
页码:270 / 294
页数:25
相关论文
共 42 条
[1]   Incorporating feature ranking and evolutionary methods for the classification of high-dimensional DNA microarray gene expression data [J].
Abedini, Mani ;
Kirley, Michael ;
Chiong, Raymond .
AUSTRALASIAN MEDICAL JOURNAL, 2013, 6 (05) :272-279
[2]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[3]   A comparison of feature selection models utilizing binary particle swarm optimization and genetic algorithm in determining coronary artery disease using support vector machine [J].
Babaoglu, Ismail ;
Findik, Oguz ;
Ulker, Erkan .
EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (04) :3177-3183
[4]   A Hamming distance based binary particle swarm optimization (HDBPSO) algorithm for high dimensional feature selection, classification and validation [J].
Banka, Haider ;
Dara, Suresh .
PATTERN RECOGNITION LETTERS, 2015, 52 :94-100
[5]  
Bharathi P T, 2014, J THEORETICAL APPL I, V7, P254
[6]   A survey on feature selection methods [J].
Chandrashekar, Girish ;
Sahin, Ferat .
COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) :16-28
[7]   Efficient ant colony optimization for image feature selection [J].
Chen, Bolun ;
Chen, Ling ;
Chen, Yixin .
SIGNAL PROCESSING, 2013, 93 (06) :1566-1576
[8]   A Competitive Swarm Optimizer for Large Scale Optimization [J].
Cheng, Ran ;
Jin, Yaochu .
IEEE TRANSACTIONS ON CYBERNETICS, 2015, 45 (02) :191-204
[9]   Improved binary PSO for feature selection using gene expression data [J].
Chuang, Li-Yeh ;
Chang, Hsueh-Wei ;
Tu, Chung-Jui ;
Yang, Cheng-Hong .
COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2008, 32 (01) :29-38
[10]  
Dash M., 1997, Intelligent Data Analysis, V1