Protein classification with imbalanced data

被引:100
作者
Zhao, Xing-Ming [1 ,2 ,3 ]
Li, Xin [4 ]
Chen, Luonan [3 ,5 ,6 ]
Aihara, Kazuyuki [1 ,3 ]
机构
[1] JST, ERATO, Aihara Complex Modelling Projects, Tokyo 1510064, Japan
[2] Chinese Acad Sci, Hefei Inst Intelligent Machines, Intelligent Comp Lab, Hefei 230031, Anhui, Peoples R China
[3] Univ Tokyo, Inst Ind Sci, Tokyo 1538505, Japan
[4] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
[5] Osaka Sangyo Univ, Dept Elect & Elect Engn, Osaka 5748530, Japan
[6] Shanghai Univ, Inst Syst Biol, Shanghai 200444, Peoples R China
关键词
D O I
10.1002/prot.21870
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Generally, protein classification is a multi-class classification problem and can be reduced to a set of binary classification problems, where one classifier is designed for each class. The proteins in one class are seen as positive examples while those outside the class are seen as negative examples. However, the imbalanced problem will arise in this case because the number of proteins in one class is usually much smaller than that of the proteins outside the class. As a result, the imbalanced data cause classifiers to tend to overfit and to perform poorly in particular on the minority class. This article presents a new technique for protein classification with imbalanced data. First, we propose a new algorithm to overcome the imbalanced problem in protein classification with a new sampling technique and a committee of classifiers. Then, classifiers trained in different feature spaces are combined together to further improve the accuracy of protein classification. The numerical experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of accuracy. The Matlab code and supplementary materials are available at http:// server2.sat. iis.u-tokyo.ac.jpl-xmzhaolproteins.html.
引用
收藏
页码:1125 / 1132
页数:8
相关论文
共 35 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] SCOP database in 2004: refinements integrate structure and sequence family data
    Andreeva, A
    Howorth, D
    Brenner, SE
    Hubbard, TJP
    Chothia, C
    Murzin, AG
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D226 - D229
  • [3] [Anonymous], 2004, ACM SIGKDD EXPLOR NE, DOI DOI 10.1145/1007730.1007736
  • [4] [Anonymous], 1993, C4 5 PROGRAMS MACH L
  • [5] Bhavani Raskutti, 2004, ACM Sigkdd Explor Newsl, V6, P60
  • [6] Can T, 2004, 2004 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, P224
  • [7] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [8] Chawla N. V., 2004, ACM Sigkdd Explorations Newsletter, V6, P1, DOI [DOI 10.1145/1007730.1007733, 10.1145/1007730.1007733]
  • [9] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [10] SMOTEBoost: Improving prediction of the minority class in boosting
    Chawla, NV
    Lazarevic, A
    Hall, LO
    Bowyer, KW
    [J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 : 107 - 119