Improving feature ranking for biomarker discovery in proteomics mass spectrometry data using genetic programming

被引:20
作者
Ahmed, Soha [1 ]
Zhang, Mengjie [1 ]
Peng, Lifeng [2 ]
机构
[1] Victoria Univ Wellington, Sch Engn & Comp Sci, Wellington, New Zealand
[2] Victoria Univ Wellington, Sch Biol Sci, Wellington, New Zealand
基金
中国国家自然科学基金;
关键词
biomarker discovery; feature selection; genetic programming; classification; FEATURE-SELECTION; EXPERT KNOWLEDGE; CANCER; CLASSIFICATION; SPECTRA;
D O I
10.1080/09540091.2014.906388
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection on mass spectrometry (MS) data is essential for improving classification performance and biomarker discovery. The number of MS samples is typically very small compared with the high dimensionality of the samples, which makes the problem of biomarker discovery very hard. In this paper, we propose the use of genetic programming for biomarker detection and classification of MS data. The proposed approach is composed of two phases: in the first phase, feature selection and ranking are performed. In the second phase, classification is performed. The results show that the proposed method can achieve better classification performance and biomarker detection rate than the information gain- (IG) based and the RELIEF feature selection methods. Meanwhile, four classifiers, Naive Bayes, J48 decision tree, random forest and support vector machines, are also used to further test the performance of the top ranked features. The results show that the four classifiers using the top ranked features from the proposed method achieve better performance than the IG and the RELIEF methods. Furthermore, GP also outperforms a genetic algorithm approach on most of the used data sets.
引用
收藏
页码:215 / 243
页数:29
相关论文
共 52 条
[1]  
Ackermann BL, 2006, CURR DRUG METAB, V7, P525
[2]  
[Anonymous], 1999, Genetic programming III: darwinian invention and problem solving
[3]  
[Anonymous], 2006, Pattern recognition and machine learning
[4]   A comprehensive approach to the analysis of matrix-assisted laser desorption/ionization-time of flight proteomics spectra from serum samples [J].
Baggerly, KA ;
Morris, JS ;
Wang, J ;
Gold, D ;
Xiao, LC ;
Coombes, KR .
PROTEOMICS, 2003, 3 (09) :1667-1672
[5]  
Baumgartner Christian, 2011, J Clin Bioinforma, V1, P2, DOI 10.1186/2043-9113-1-2
[6]   MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evolution [J].
Cai, JJ ;
Smith, DK ;
Xia, XH ;
Yuen, KY .
BMC BIOINFORMATICS, 2005, 6 (1)
[7]   A Critical Assessment of Feature Selection Methods for Biomarker Discovery in Clinical Proteomics [J].
Christin, Christin ;
Hoefsloot, Huub C. J. ;
Smilde, Age K. ;
Hoekman, B. ;
Suits, Frank ;
Bischoff, Rainer ;
Horvatovich, Peter .
MOLECULAR & CELLULAR PROTEOMICS, 2013, 12 (01) :263-276
[8]   MALDI-TOF mass spectrometry analysis of cerebrospinal fluid tryptic peptide profiles to diagnose leptomeningeal metastases in patients with breast cancer [J].
Dekker, LJ ;
Boogerd, W ;
Stockhammer, G ;
Dalebout, JC ;
Siccama, I ;
Zheng, PP ;
Bonfrer, JM ;
Verschuuren, JJ ;
Jenster, G ;
Verbeek, MM ;
Luider, TM ;
Smitt, PAS .
MOLECULAR & CELLULAR PROTEOMICS, 2005, 4 (09) :1341-1349
[9]  
Driscoll JA, 2003, GENET PROGR SER, V6, P25
[10]  
Fogelberg C, 2005, LECT NOTES ARTIF INT, V3809, P369