Feature selection and validated predictive performance in the domain of Legionella pneumophila: A comparative study

被引:10
作者
Van Der Ploeg T. [1 ,2 ]
Steyerberg E.W. [2 ]
机构
[1] Department of Science, Medical Center Alkmaar, Inholland University, Alkmaar
[2] Department of Public Health, Erasmus MC-University Medical Center Rotterdam, Rotterdam
关键词
Support Vector Machine; Feature Selection; Random Forest; Support Vector Machine Model; Little Absolute Shrinkage Selection Operator;
D O I
10.1186/s13104-016-1945-2
中图分类号
学科分类号
摘要
Background: Genetic comparisons of clinical and environmental Legionella strains form an essential part of outbreak investigations. DNA microarrays often comprise many DNA markers (features). Feature selection and the development of prediction models are particularly challenging in this domain with many variables and comparatively few subjects or data points. We aimed to compare modeling strategies to develop prediction models for classifying infections as clinical or environmental. Methods: We applied a bootstrap strategy for preselecting important features to a database containing 222 Legionella pneumophila strains with 448 continuous markers and a dichotomous outcome (clinical or environmental). Feature selection was done with 50 bootstrap samples resulting in a top 10 of most important features for each of four modeling techniques: classification and regression trees (CART), random forests (RF), support vector machines (SVM) and least absolute shrinkage and selection operator (LASSO). Validation was done in a second bootstrap resampling loop (200x) for evaluation of discriminatory model performance according to the AUC. Results: The top 5 of selected features differed considerably between the various modeling techniques, with only one common feature ("LePn.007B8"). The mean validated AUC-values of the SVM model and the CART model were 0.859 and 0.873 respectively. The LASSO and the RF model showed higher validated AUC-values (0.925 and 0.975 respectively). Conclusions: In the domain of Legionella pneumophila, which comprises many potential features for classifying of infections as clinical or environmental, the RF and LASSO techniques provide good prediction models. The identification of potentially biologically relevant features is highly dependent on the technique used, and should hence be interpreted with caution. © 2016 van de Ploeg and Steyerberg.
引用
收藏
相关论文
共 29 条
  • [1] Fraser D.W., Tsai T.R., Orenstein W., Parkin W.E., Beecham H.J., Sharrar R.G., Harris J., Mallison G.F., Martin S.M., McDade J.E., Shepard C.C., Brachman P.S., Legionnaires' disease: Description of an epidemic of pneumonia, N Engl J Med, 297, pp. 1189-1197, (1977)
  • [2] Fry N.K., Alexiou-Daniel S., Bangsborg J.M., Bernander S., Castellani Pastoris M., Etienne J., Forsblom B., Gaia V., Helbig J.H., Lindsay D., Christian Luck P., Pelaz C., Uldum S.A., Harrison T.G., A multicenter evaluation of genotypic methods for the epidemiologic typing of Legionella pneumophila serogroup 1: Results of a pan-European study, Clin Microbiol Infect, 5, pp. 462-477, (1999)
  • [3] Chiarini A., Bonura C., Ferraro D., Barbaro R., Cala C., Distefano S., Casuccio N., Belfiore S., Giammanco A., Genotyping of Legionella pneumophila serogroup 1 strains isolated in Northern Sicily. Italy, New Microbiol, 31, pp. 217-228, (2008)
  • [4] Doleans A., Aurell H., Reyrolle M., Lina G., Freney J., Vandenesch F., Etienne J., Jarraud S., Clinical and Environmental Distributions of Legionella strains in France are different, J Clin Microbiol, 42, pp. 458-460, (2004)
  • [5] Den Boer J.W., Bruin J.P., Verhoef L.P.B., Van Der Zwaluw K., Jansen R., Yzerman E.P.F., Genotypic comparison of clinical Legionella isolates and patient-related environmental isolates in The Netherlands, 2002-2006, Clin Microbiol Infect, 14, pp. 459-466, (2008)
  • [6] Harrison T.G., Afshar B., Doshi N., Fry N.K., Lee J.V., Distribution of Legionella pneumophila serogroups, monoclonal antibody subgroups and DNA sequence types in recent clinical and environmental isolates from England and Wales (2000-2008), Eur J Clin Microbiol Infect Dis, 28, pp. 781-791, (2009)
  • [7] McCarthy M.I., Abecasis G.R., Cardon L.R., Goldstein D.B., Little J., Ioannidis J.P., Hirschhorn J.N., Genome-wide association studies for complex traits: Consensus, uncertainty and challenges, Nat RevGenet, 9, pp. 356-369, (2008)
  • [8] Saeys Y., Inza I., Larranaga P., A review of feature selection techniques in bioinformatics, Bioinformatics, 23, 19, pp. 2507-2517, (2007)
  • [9] Guyon I., Elisseeff A., An introduction to variable and feature selection, J Mach Learn Res, 3, pp. 1157-1182, (2003)
  • [10] Wang H.Y., Zheng H., Azuaje F., Evaluation of computational classification methods for discriminating human heart failure etiology based on gene expression data, Computers in Cardiology, pp. 277-280, (2006)