Protein classification with imbalanced data

被引:100
作者
Zhao, Xing-Ming [1 ,2 ,3 ]
Li, Xin [4 ]
Chen, Luonan [3 ,5 ,6 ]
Aihara, Kazuyuki [1 ,3 ]
机构
[1] JST, ERATO, Aihara Complex Modelling Projects, Tokyo 1510064, Japan
[2] Chinese Acad Sci, Hefei Inst Intelligent Machines, Intelligent Comp Lab, Hefei 230031, Anhui, Peoples R China
[3] Univ Tokyo, Inst Ind Sci, Tokyo 1538505, Japan
[4] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
[5] Osaka Sangyo Univ, Dept Elect & Elect Engn, Osaka 5748530, Japan
[6] Shanghai Univ, Inst Syst Biol, Shanghai 200444, Peoples R China
关键词
D O I
10.1002/prot.21870
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Generally, protein classification is a multi-class classification problem and can be reduced to a set of binary classification problems, where one classifier is designed for each class. The proteins in one class are seen as positive examples while those outside the class are seen as negative examples. However, the imbalanced problem will arise in this case because the number of proteins in one class is usually much smaller than that of the proteins outside the class. As a result, the imbalanced data cause classifiers to tend to overfit and to perform poorly in particular on the minority class. This article presents a new technique for protein classification with imbalanced data. First, we propose a new algorithm to overcome the imbalanced problem in protein classification with a new sampling technique and a committee of classifiers. Then, classifiers trained in different feature spaces are combined together to further improve the accuracy of protein classification. The numerical experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of accuracy. The Matlab code and supplementary materials are available at http:// server2.sat. iis.u-tokyo.ac.jpl-xmzhaolproteins.html.
引用
收藏
页码:1125 / 1132
页数:8
相关论文
共 35 条
  • [31] Tax, 2001, THESIS DELFT U TECHN
  • [32] Vapnik V. N., 1998, Statistical learning theory, V1, DOI DOI 10.1007/978-1-4419-1428-6_5864
  • [33] Reduction techniques for instance-based learning algorithms
    Wilson, DR
    Martinez, TR
    [J]. MACHINE LEARNING, 2000, 38 (03) : 257 - 286
  • [34] A novel approach to extracting features from motif content and protein composition for protein sequence classification
    Zhao, XM
    Cheung, YM
    Huang, DS
    [J]. NEURAL NETWORKS, 2005, 18 (08) : 1019 - 1028
  • [35] Zheng Z., 2004, SIGKDD EXPLOR NEWSL, V6, P80, DOI [DOI 10.1145/1007730.1007741, 10.1145/1007730.1007741]