A novel approach to extracting features from motif content and protein composition for protein sequence classification

被引:44
作者
Zhao, XM
Cheung, YM
Huang, DS
机构
[1] Chinese Acad Sci, Inst Intelligent Machines, Intelligent Comp Lab, Hefei 230031, Anhui, Peoples R China
[2] Univ Sci & Technol China, Dept Automat, Hefei 230026, Anhui, Peoples R China
[3] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
genetic algorithm; motif content; protein composition; protein sequence classification; support vector machine;
D O I
10.1016/j.neunet.2005.07.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a novel approach to extracting features from motif content and protein composition for protein sequence classification. First, we formulate a protein sequence as a fixed-dimensional vector using the motif content and protein composition. Then, we further project the vectors into a low-dimensional space by the Principal Component Analysis (PCA) so that they can be represented by a combination of the eigenvectors of the covariance matrix of these vectors. Subsequently, the Genetic Algorithm (GA) is used to extract a subset of biological and functional sequence features from the eigen-space and to optimize the regularization parameter of the Support Vector Machine (SVM) simultaneously. Finally, we utilize the SVM classifiers to classify protein sequences into corresponding families based on the selected feature subsets. In comparison with the existing PSI-BLAST and SVM-pairwise methods, the experiments show the promising results of our approach. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1019 / 1028
页数:10
相关论文
共 36 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] [Anonymous], 1993, C4. 5: Programs for empirical learning
  • [3] The Protein Information Resource (PIR)
    Barker, WC
    Garavelli, JS
    Huang, HZ
    McGarvey, PB
    Orcutt, BC
    Srinivasarao, GY
    Xiao, CL
    Yeh, LSL
    Ledley, RS
    Janda, JF
    Pfeiffer, F
    Mewes, HW
    Tsugita, A
    Wu, C
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 41 - 44
  • [4] Remote homology detection: a motif based approach
    Ben-Hur, Asa
    Brutlag, Douglas
    [J]. BIOINFORMATICS, 2003, 19 : i26 - i33
  • [5] BRENNAN RG, 1989, J BIOL CHEM, V264, P1903
  • [6] The ASTRAL compendium for protein structure and sequence analysis
    Brenner, SE
    Koehl, P
    Levitt, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 254 - 256
  • [7] Cherkassky V, 1997, IEEE Trans Neural Netw, V8, P1564, DOI 10.1109/TNN.1997.641482
  • [8] Deb K, 2004, LECT NOTES COMPUT SC, V2936, P141
  • [9] Deb K., 2003, 2003001 KANGAL
  • [10] DEB K, 2003, 2003006 KANGAL