Automated identification of protein classification and detection of annotation errors in protein databases using statistical approaches

被引:0
作者
Ning, Kang
Chua, Hon Nian
机构
[1] Natl Univ Singapore, Sch Comp, Singapore 117543, Singapore
[2] Natl Univ Singapore, Singapore 117548, Singapore
来源
KNOWLEDGE DISCOVERY IN LIFE SCIENCE LITERATURE, PROCEEDINGS | 2006年 / 3886卷
关键词
D O I
暂无
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Because of the importance of proteins in life sciences, biologists have put great effort to elucidate their structures, functions and expression profiles to help us understand their roles in living cells in the past few decades. Currently, protein databases axe widely used by biologists. Hence it is critical that the information that researcher work with should be as accurate as possible. However, the sizes of these databases are increasing rapidly, and existing protein databases axe already known to contain annotation errors. In this paper, we investigate the reason why protein databases possess mis-annotated sequence data. Then, by using some statistical approaches, we derive a method to automatically filter and assess the reliability of the data from databases. This is important to provide accurate information to researchers and will help reduce further errors in annotation resulting from existed mis-annotated sequence data. Our initial experiments proved our theoretical findings, and show that our methods can effectively detect the mis-annotated sequence data.
引用
收藏
页码:123 / 138
页数:16
相关论文
共 17 条
  • [1] Automatic annotation of protein function based on family identification
    Abascal, F
    Valencia, A
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2003, 53 (03) : 683 - 692
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] PRINTS and PRINTS-S shed light on protein ancestry
    Attwood, TK
    Blythe, MJ
    Flower, DR
    Gaulton, A
    Mabey, JE
    Maudling, N
    McGregor, L
    Mitchell, AL
    Moulton, G
    Paine, K
    Scordis, P
    [J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 239 - 241
  • [4] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [5] Barker WC, 1996, METHOD ENZYMOL, V266, P59
  • [6] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
  • [7] The PROSITE database, its status in 2002
    Falquet, L
    Pagni, M
    Bucher, P
    Hulo, N
    Sigrist, CJA
    Hofmann, K
    Bairoch, A
    [J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 235 - 238
  • [8] Estimating misclassification error with small samples via bootstrap cross-validation
    Fu, WJJ
    Carroll, RJ
    Wang, SJ
    [J]. BIOINFORMATICS, 2005, 21 (09) : 1979 - 1986
  • [9] PANDORA: keyword-based analysis of protein sets by integration of annotation sources
    Kaplan, N
    Vaaknin, A
    Linial, M
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (19) : 5617 - 5626
  • [10] Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT
    Kretschmann, E
    Fleischmann, W
    Apweiler, R
    [J]. BIOINFORMATICS, 2001, 17 (10) : 920 - 926