MFSC: Multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou's PseAAC components

被引:30
作者
Ahmad, Jamal [1 ]
Hayat, Maqsood [1 ]
机构
[1] Abdul Wali Khan Univ Mardan, Dept Comp Sci, Mardan, Pakistan
关键词
Golgi apparatus; SAAC; PSSM; k-Nearest Neighbor; PREDICT SUBCELLULAR-LOCALIZATION; AMYOTROPHIC-LATERAL-SCLEROSIS; LYSINE SUCCINYLATION SITES; AMINO-ACID-COMPOSITION; 3 DIFFERENT MODES; ENSEMBLE CLASSIFIER; RECOMBINATION SPOTS; MEMBRANE-PROTEINS; WEB SERVER; DISCRIMINANT ALGORITHM;
D O I
10.1016/j.jtbi.2018.12.017
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Automatic identification of protein subcellular localization has gained much popularity in the last few decades. Subcellular localizations are useful in diagnosis of different diseases as well as in the process of drug development. Golgi is a vital type of protein, which provides means of transportation to several other proteins destined for lysosome, plasma membrane and secretion etc. Cis-Golgi and trans-Golgi are two ends of Golgi protein meant for reception and transmission of various substances. Dysfunction in Golgi proteins may lead to different types of diseases especially the inheritable and neurodegenerative problems. Due to the significance of Golgi proteins, it is indispensable to correctly identify the Golgi proteins. In this paper, a novel and high throughput computational model is proposed which can identify the sub-Golgi proteins precisely. Discrete and evolutionary feature extraction schemes are applied so that all the salient, noiseless, and relevant information from protein sequences could be captured. Unfortunately, the benchmark dataset publicly available is quite imbalance, where trans-Golgi sequences constitute 72% of the whole dataset that reflects biasness, redundancy, and lack of hypothesis generalization. In order to cover the limitations of imbalance data, Synthetic Minority over Sampling Technique is utilized to balance the number of instances in different classes of the dataset. In addition, a condense feature space is formed by fusing the high rank features of eleven different feature selection techniques. The high rank features are selected through majority voting algorithm; consequently, the feature space is reduced 85%. The experiential results demonstrate that kNN classifier obtained promising results in combination with hybrid feature space. It has yielded an accuracy of 98% in jackknife cross-validation, 94% in independent data and 96% in 10-fold cross-validation test. It is ascertained that the proposed model is reliable, consistent and serves as a valuable tool for the research community. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:99 / 109
页数:11
相关论文
共 131 条
  • [1] Acid S., 2011, Proceedings of the 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA), P619, DOI 10.1109/ISDA.2011.6121724
  • [2] Mito-GSAAC: mitochondria prediction using genetic ensemble classifier and split amino acid composition
    Afridi, Tariq Habib
    Khan, Asifullah
    Lee, Yeon Soo
    [J]. AMINO ACIDS, 2012, 42 (04) : 1443 - 1454
  • [3] iACP-GAEnsC: Evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space
    Akbar, Shahid
    Hayat, Maqsood
    Iqbal, Muhammad
    Jan, Mian Ahmad
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2017, 79 : 62 - 70
  • [4] STATISTICS NOTES - DIAGNOSTIC-TESTS-1 - SENSITIVITY AND SPECIFICITY .3.
    ALTMAN, DG
    BLAND, JM
    [J]. BRITISH MEDICAL JOURNAL, 1994, 308 (6943) : 1552 - 1552
  • [5] AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION
    ALTMAN, NS
    [J]. AMERICAN STATISTICIAN, 1992, 46 (03) : 175 - 185
  • [6] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [7] [Anonymous], BIOINFORMATICS
  • [8] [Anonymous], 2010, P 16 ACM SIGKDD INT, DOI [10.1145/1835804.1835848, DOI 10.1145/1835804.1835848]
  • [9] [Anonymous], 2017, GENOMICS
  • [10] [Anonymous], GENOMICS