Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

被引:20
作者
Qiao, Shanping [1 ,2 ]
Yan, Baoqiang [3 ]
Li, Jing [4 ]
机构
[1] Shandong Normal Univ, Sch Management Sci & Engn, Jinan 250014, Shandong, Peoples R China
[2] Univ Jinan, Sch Informat Sci & Engn, Shandong Prov Key Lab Network Based Intelligent C, Jinan 250022, Shandong, Peoples R China
[3] Shandong Normal Univ, Sch Math Sci, Jinan 250014, Shandong, Peoples R China
[4] Case Western Reserve Univ, Dept Elect Engn & Comp Sci, 10900 Euclid Ave, Cleveland, OH 44106 USA
基金
中国国家自然科学基金; 美国国家卫生研究院; 美国国家科学基金会;
关键词
Ensemble learning; Feature fusion; Protein subcellular location prediction; Weighted k-nearest neighbors; AMINO-ACID-COMPOSITION; RECENT PROGRESS; GENERAL-FORM; LOCATIONS; CLASSIFIER; SINGLEPLEX; SELECTION;
D O I
10.1007/s10489-017-1029-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As an important attribute of proteins, protein subcellular location(s) can provide valuable information about their functions. Determining protein subcellular locations using experimental methods are usually expensive and time-consuming. Over the years, a variety of computational approaches have been developed to predict protein subcellular locations based on knowledge of known protein locations. However, the problem is inherently hard, especially for proteins that can exist at multiple subcellular locations. Further studies are still in great need in this area. In this paper, we propose an ensemble learning framework that utilizes a modified Weighted K-Nearest Neighbors (WKNN) as the basic learning algorithm. Two different types of features are considered and extracted from training data, which are based on protein amino acid compositions (Amphiphilic Pseudo Amino Acid Composition, or AmPseAAC) and protein sequence similarities (Protein Similarity Measure, or PSM), respectively. Two individual classifiers are trained separately based on these two types of features and each assigns a probability distribution over different locations to a query protein. Based on the outputs of the two base classifiers, a novel ensemble strategy named Maximized Probability on Label (MPoL) is proposed. The strategy produces a final set of protein locations for each protein by integrating prediction results of the base classifiers through an optimization procedure. To measure the prediction quality of the proposed approach, two different types of evaluation metrics, example-based metrics and label-based metrics, are used. To evaluate the performance of our approach objectively, we compare its results with those predicted by another popular method named iLoc-Animal on a benchmark dataset through cross-validation. Results show that in terms of absolute true success rate on multi-location prediction, MPoL has achieved much better results than iLoc-Animal. It implies that the proposed method has some potential to solve a diverse set of multi-label learning problems.
引用
收藏
页码:1813 / 1824
页数:12
相关论文
共 41 条
[1]   Update on activities at the Universal Protein Resource (UniProt) in 2013 [J].
Apweiler, Rolf ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Alam-Faruque, Yasmin ;
Alpi, Emanuela ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Casanova, Elisabet Barrera ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chan, Wei Mun ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dimmer, Emily ;
Fazzini, Francesco ;
Gane, Paul ;
Fedotov, Alexander ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Jacobsen, Julius ;
Jones, Rachel ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightingale, Andrew ;
Orchard, Sandra ;
Patient, Samuel ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Sawford, Tony ;
Sehra, Harminder ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier .
NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) :D43-D47
[2]   MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction [J].
Blum, Torsten ;
Briesemeister, Sebastian ;
Kohlbacher, Oliver .
BMC BIOINFORMATICS, 2009, 10 :274
[3]   WEIGHTED NEAREST NEIGHBOR RULE FOR CLASS DEPENDENT SAMPLE SIZES [J].
BROWN, TA ;
KOPLOWITZ, J .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1979, 25 (05) :617-619
[4]   Identifying the singleplex and multiplex proteins based on transductive learning for protein subcellular localization prediction [J].
Cao, Junzhe ;
Liu, Wenqi ;
He, Jianjun ;
Gu, Hong .
BIOTECHNOLOGY LETTERS, 2013, 35 (07) :1107-1113
[5]   A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors [J].
Carr, Kareem ;
Murray, Eleanor ;
Armah, Ebenezer ;
He, Rong L. ;
Yau, Stephen S. -T. .
PLOS ONE, 2010, 5 (03)
[6]   Predicting protein localization in budding yeast [J].
Chou, KC ;
Cai, YD .
BIOINFORMATICS, 2005, 21 (07) :944-950
[7]   Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes [J].
Chou, KC .
BIOINFORMATICS, 2005, 21 (01) :10-19
[8]   PREDICTION OF PROTEIN STRUCTURAL CLASSES [J].
CHOU, KC ;
ZHANG, CT .
CRITICAL REVIEWS IN BIOCHEMISTRY AND MOLECULAR BIOLOGY, 1995, 30 (04) :275-349
[9]   Recent progress in protein subcellular location prediction [J].
Chou, Kuo-Chen ;
Shen, Hong-Bin .
ANALYTICAL BIOCHEMISTRY, 2007, 370 (01) :1-16
[10]   Euk-mPLoc: A fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites [J].
Chou, Kuo-Chen ;
Shen, Hong-Bin .
JOURNAL OF PROTEOME RESEARCH, 2007, 6 (05) :1728-1734