A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data

被引:88
作者
Gromski, Piotr S. [1 ]
Xu, Yun [1 ]
Correa, Elon [1 ]
Ellis, David I. [1 ]
Turner, Michael L. [2 ]
Goodacre, Royston [1 ]
机构
[1] Univ Manchester, Manchester Inst Biotechnol, Sch Chem, Manchester M1 7DN, Lancs, England
[2] Univ Manchester, Sch Chem, Manchester M13 9PL, Lancs, England
关键词
Variable selection; Supervised learning; Bootstrapping Double cross-validation; Pyrolysis mass spectrometry; Bacillus; PYROLYSIS-GAS CHROMATOGRAPHY; SUPPORT VECTOR MACHINES; PARTIAL LEAST-SQUARES; VARIABLE SELECTION; CANCER CLASSIFICATION; RAPID IDENTIFICATION; NEURAL-NETWORKS; BACILLUS SPORES; GENE SELECTION; DISCRIMINATION;
D O I
10.1016/j.aca.2014.03.039
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Many analytical approaches such as mass spectrometry generate large amounts of data (input variables) per sample analysed, and not all of these variables are important or related to the target output of interest. The selection of a smaller number of variables prior to sample classification is a widespread task in many research studies, where attempts are made to seek the lowest possible set of variables that are still able to achieve a high level of prediction accuracy; in other words, there is a need to generate the most parsimonious solution when the number of input variables is huge but the number of samples/objects are smaller. Here, we compare several different variable selection approaches in order to ascertain which of these are ideally suited to achieve this goal. All variable selection approaches were applied to the analysis of a common set of metabolomics data generated by Curie-point pyrolysis mass spectrometry (Py-MS), where the goal of the study was to classify the Gram-positive bacteria Bacillus. These approaches include stepwise forward variable selection, used for linear discriminant analysis (LDA); variable importance for projection (VIP) coefficient, employed in partial least squares-discriminant analysis (PLS-DA); support vector machines-recursive feature elimination (SVM-RFE); as well as the mean decrease in accuracy and mean decrease in Gini, provided by random forests (RF). Finally, a double cross-validation procedure was applied to minimize the consequence of overfitting. The results revealed that RF with its variable selection techniques and SVM combined with SVM-RFE as a variable selection method, displayed the best results in comparison to other approaches. (C) 2014 Elsevier B. V. All rights reserved.
引用
收藏
页码:1 / 8
页数:8
相关论文
共 58 条
[1]   Discrimination of three tobacco types (Burley, Virginia and Oriental) by pyrolysis single-photon ionisation-time-of-flight mass spectrometry and advanced statistical methods [J].
Adam, T ;
Ferge, T ;
Mitschke, S ;
Streibel, T ;
Baker, RR ;
Zimmermann, R .
ANALYTICAL AND BIOANALYTICAL CHEMISTRY, 2005, 381 (02) :487-499
[2]   Variable selection in discriminant partial least-squares analysis [J].
Alsberg, BK ;
Kell, DB ;
Goodacre, R .
ANALYTICAL CHEMISTRY, 1998, 70 (19) :4126-4133
[3]   Partial least squares for discrimination [J].
Barker, M ;
Rayens, W .
JOURNAL OF CHEMOMETRICS, 2003, 17 (03) :166-173
[4]   CLASSIFIER SYSTEMS AND GENETIC ALGORITHMS [J].
BOOKER, LB ;
GOLDBERG, DE ;
HOLLAND, JH .
ARTIFICIAL INTELLIGENCE, 1989, 40 (1-3) :235-282
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
Brereton R.G., 2003, Chemometrics: Data Analysis for the Laboratory and Chemical Plant
[7]   Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data [J].
Brereton, Richard G. .
TRAC-TRENDS IN ANALYTICAL CHEMISTRY, 2006, 25 (11) :1103-1111
[8]   Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry [J].
Broadhurst, D ;
Goodacre, R ;
Jones, A ;
Rowland, JJ ;
Kell, DB .
ANALYTICA CHIMICA ACTA, 1997, 348 (1-3) :71-86
[9]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[10]   Spectroscopic and chromatographic studies of sculptural polychromy in the Zhongshan Grottoes (R.P.C.) [J].
Cauzzi, Diego ;
Chiavari, Giuseppe ;
Montalbani, Simona ;
Melucci, Dora ;
Cam, Darinn ;
Ling, He .
JOURNAL OF CULTURAL HERITAGE, 2013, 14 (01) :70-75