Selection of relevant features from amino acids enables development of robust classifiers

被引:7
作者
Das Roy, Rishi [1 ]
Dash, Debasis [1 ]
机构
[1] CSIR Inst Genom & Integrat Biol, GN Ramachandran Knowledge Ctr Genome Informat, Delhi 110007, India
关键词
Protein sequence analysis; Feature extraction; Protein classifier design and evaluation; Mitochondrial protein; Machine learning; SUPPORT VECTOR MACHINE; WEB SERVER; PREDICTION; PROTEIN; SEQUENCE; CLASSIFICATION; LOCALIZATION; PEPTIDES; TOOL;
D O I
10.1007/s00726-014-1697-z
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Machine learning (ML) has been extensively applied to develop models and to understand high-throughput data of biological processes. However, new ML models, trained with novel experimental results, are required to build regularly for more precise predictions. ML methods can build models from numeric data, whereas biological data are generally textual (DNA, protein sequences) or images and needs feature calculation algorithms to generate quantitative features. Programming skills along with domain knowledge are required to develop these algorithms. Therefore, the process of knowledge discovery through ML is decelerated due to lack of generic tools to construct features and to build models directly from the data. Hence, we developed a schema that calculates about 5,000 features, selects relevant features and develops protein classifiers from the training data. To demonstrate the general applicability and robustness of our method, fungal adhesins and nuclear receptor proteins were used for building classifiers which outperformed existing classifiers when tested on independent data. Next, we built a classifier for mitochondrial proteins of Plasmodium falciparum which causes human malaria because the latest corresponding classifiers are not publically accessible. Our classifier attained 98.18 % accuracy and 0.95 Matthews correlation coefficient by fivefold cross-validation and outperformed existing classifiers on independent test set. We implemented this schema as user-friendly and open source application Pro-Gyan (http://code.google.com/p/pro-gyan/), to build and share executable classifiers without programming knowledge.
引用
收藏
页码:1343 / 1351
页数:9
相关论文
共 44 条
  • [1] Sequence and chromatin determinants of cell-type-specific transcription factor binding
    Arvey, Aaron
    Agius, Phaedra
    Noble, William Stafford
    Leslie, Christina
    [J]. GENOME RESEARCH, 2012, 22 (09) : 1723 - 1734
  • [2] Evolutionary and genetic analyses of mitochondrial translation initiation factors identify the missing mitochondrial IF3 in S. cerevisiae
    Atkinson, Gemma C.
    Kuzmenko, Anton
    Kamenski, Piotr
    Vysokikh, Mikhail Y.
    Lakunina, Valentina
    Tankov, Stoyan
    Smirnova, Ekaterina
    Soosaar, Aksel
    Tenson, Tanel
    Hauryliuk, Vasili
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (13) : 6122 - 6134
  • [3] Long noncoding RNAs are rarely translated in two human cell lines
    Banfai, Balazs
    Jia, Hui
    Khatun, Jainab
    Wood, Emily
    Risk, Brian
    Gundling, William E., Jr.
    Kundaje, Anshul
    Gunawardena, Harsha P.
    Yu, Yanbao
    Xie, Ling
    Krajewski, Krzysztof
    Strahl, Brian D.
    Chen, Xian
    Bickel, Peter
    Giddings, Morgan C.
    Brown, James B.
    Lipovich, Leonard
    [J]. GENOME RESEARCH, 2012, 22 (09) : 1646 - 1657
  • [4] Properties and prediction of mitochondrial transit peptides from Plasmodium falciparum
    Bender, A
    van Dooren, GG
    Ralph, SA
    McFadden, GI
    Schneider, G
    [J]. MOLECULAR AND BIOCHEMICAL PARASITOLOGY, 2003, 132 (02) : 59 - 66
  • [5] Enzyme family classification by support vector machines
    Cai, CZ
    Han, LY
    Ji, ZL
    Chen, YZ
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 55 (01) : 66 - 76
  • [6] propy: a tool to generate various modes of Chou's PseAAC
    Cao, Dong-Sheng
    Xu, Qing-Song
    Liang, Yi-Zeng
    [J]. BIOINFORMATICS, 2013, 29 (07) : 960 - 962
  • [7] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [8] Chen YW, 2006, STUD FUZZ SOFT COMP, V207, P315
  • [9] Using increment of diversity to predict mitochondrial proteins of malaria parasite: integrating pseudo-amino acid composition and structural alphabet
    Chen, Ying-Li
    Li, Qian-Zhong
    Zhang, Li-Qing
    [J]. AMINO ACIDS, 2012, 42 (04) : 1309 - 1316
  • [10] Prediction of membrane protein types by incorporating amphipathic effects
    Chou, KC
    Cai, YD
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2005, 45 (02) : 407 - 413