Automated identification of protein classification and detection of annotation errors in protein databases using statistical approaches

被引：0

作者：

Ning, Kang

Chua, Hon Nian

机构：

[1] Natl Univ Singapore, Sch Comp, Singapore 117543, Singapore

[2] Natl Univ Singapore, Singapore 117548, Singapore

来源：

KNOWLEDGE DISCOVERY IN LIFE SCIENCE LITERATURE, PROCEEDINGS | 2006年 / 3886卷

关键词：

D O I：

暂无

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Because of the importance of proteins in life sciences, biologists have put great effort to elucidate their structures, functions and expression profiles to help us understand their roles in living cells in the past few decades. Currently, protein databases axe widely used by biologists. Hence it is critical that the information that researcher work with should be as accurate as possible. However, the sizes of these databases are increasing rapidly, and existing protein databases axe already known to contain annotation errors. In this paper, we investigate the reason why protein databases possess mis-annotated sequence data. Then, by using some statistical approaches, we derive a method to automatically filter and assess the reliability of the data from databases. This is important to provide accurate information to researchers and will help reduce further errors in annotation resulting from existed mis-annotated sequence data. Our initial experiments proved our theoretical findings, and show that our methods can effectively detect the mis-annotated sequence data.

引用

页码：123 / 138

页数：16

共 17 条

[1] Automatic annotation of protein function based on family identification
Abascal, F
Valencia, A
[J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2003, 53 (03) : 683 - 692
[2] BASIC LOCAL ALIGNMENT SEARCH TOOL
ALTSCHUL, SF
GISH, W
MILLER, W
MYERS, EW
LIPMAN, DJ
[J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
[3] PRINTS and PRINTS-S shed light on protein ancestry
Attwood, TK
Blythe, MJ
Flower, DR
Gaulton, A
Mabey, JE
Maudling, N
McGregor, L
Mitchell, AL
Moulton, G
Paine, K
Scordis, P
[J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 239 - 241
[4] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
Bairoch, A
Apweiler, R
[J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
[5] Barker WC, 1996, METHOD ENZYMOL, V266, P59
[6] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[7] The PROSITE database, its status in 2002
Falquet, L
Pagni, M
Bucher, P
Hulo, N
Sigrist, CJA
Hofmann, K
Bairoch, A
[J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 235 - 238
[8] Estimating misclassification error with small samples via bootstrap cross-validation
Fu, WJJ
Carroll, RJ
Wang, SJ
[J]. BIOINFORMATICS, 2005, 21 (09) : 1979 - 1986
[9] PANDORA: keyword-based analysis of protein sets by integration of annotation sources
Kaplan, N
Vaaknin, A
Linial, M
[J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (19) : 5617 - 5626
[10] Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT
Kretschmann, E
Fleischmann, W
Apweiler, R
[J]. BIOINFORMATICS, 2001, 17 (10) : 920 - 926

← 1 2 →