Protein classification with imbalanced data

被引：100

作者：

Zhao, Xing-Ming ^{[1
,2
,3
]}

Li, Xin ^{[4
]}

Chen, Luonan ^{[3
,5
,6
]}

Aihara, Kazuyuki ^{[1
,3
]}

机构：

[1] JST, ERATO, Aihara Complex Modelling Projects, Tokyo 1510064, Japan

[2] Chinese Acad Sci, Hefei Inst Intelligent Machines, Intelligent Comp Lab, Hefei 230031, Anhui, Peoples R China

[3] Univ Tokyo, Inst Ind Sci, Tokyo 1538505, Japan

[4] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China

[5] Osaka Sangyo Univ, Dept Elect & Elect Engn, Osaka 5748530, Japan

[6] Shanghai Univ, Inst Syst Biol, Shanghai 200444, Peoples R China

来源：

PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS | 2008年 / 70卷 / 04期

关键词：

D O I：

10.1002/prot.21870

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Generally, protein classification is a multi-class classification problem and can be reduced to a set of binary classification problems, where one classifier is designed for each class. The proteins in one class are seen as positive examples while those outside the class are seen as negative examples. However, the imbalanced problem will arise in this case because the number of proteins in one class is usually much smaller than that of the proteins outside the class. As a result, the imbalanced data cause classifiers to tend to overfit and to perform poorly in particular on the minority class. This article presents a new technique for protein classification with imbalanced data. First, we propose a new algorithm to overcome the imbalanced problem in protein classification with a new sampling technique and a committee of classifiers. Then, classifiers trained in different feature spaces are combined together to further improve the accuracy of protein classification. The numerical experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of accuracy. The Matlab code and supplementary materials are available at http:// server2.sat. iis.u-tokyo.ac.jpl-xmzhaolproteins.html.

引用

页码：1125 / 1132

页数：8

共 35 条

[1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Altschul, SF
Madden, TL
Schaffer, AA
Zhang, JH
Zhang, Z
Miller, W
Lipman, DJ
[J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
[2] SCOP database in 2004: refinements integrate structure and sequence family data
Andreeva, A
Howorth, D
Brenner, SE
Hubbard, TJP
Chothia, C
Murzin, AG
[J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D226 - D229
[3] [Anonymous], 2004, ACM SIGKDD EXPLOR NE, DOI DOI 10.1145/1007730.1007736
[4] [Anonymous], 1993, C4 5 PROGRAMS MACH L
[5] Bhavani Raskutti, 2004, ACM Sigkdd Explor Newsl, V6, P60
[6] Can T, 2004, 2004 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, P224
[7] LIBSVM: A Library for Support Vector Machines
Chang, Chih-Chung
Lin, Chih-Jen
[J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8] Chawla N. V., 2004, ACM Sigkdd Explorations Newsletter, V6, P1, DOI [DOI 10.1145/1007730.1007733, 10.1145/1007730.1007733]
[9] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[10] SMOTEBoost: Improving prediction of the minority class in boosting
Chawla, NV
Lazarevic, A
Hall, LO
Bowyer, KW
[J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 : 107 - 119

← 1 2 3 4 →