Estimating the annotation error rate of curated GO database sequence annotations

被引:104
作者
Jones, Craig E. [1 ]
Brown, Alfred L.
Baumann, Ute
机构
[1] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5001, Australia
[2] Univ Adelaide, Australian Ctr Plant Funct Genom, Glen Osmond, SA 5064, Australia
关键词
D O I
10.1186/1471-2105-8-170
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences. Results: We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%. Conclusion: While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.
引用
收藏
页数:9
相关论文
共 18 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   Mining sequence annotation databanks for association patterns [J].
Artamonova, II ;
Frishman, G ;
Gelfand, MS ;
Frishman, D .
BIOINFORMATICS, 2005, 21 :49-57
[3]   The universal protein resource (UniProt) [J].
Bairoch, A ;
Apweiler, R ;
Wu, CH ;
Barker, WC ;
Boeckmann, B ;
Ferro, S ;
Gasteiger, E ;
Huang, HZ ;
Lopez, R ;
Magrane, M ;
Martin, MJ ;
Natale, DA ;
O'Donovan, C ;
Redaschi, N ;
Yeh, LSL .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D154-D159
[4]   Errors in genome annotation [J].
Brenner, SE .
TRENDS IN GENETICS, 1999, 15 (04) :132-133
[5]   The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology [J].
Camon, E ;
Magrane, M ;
Barrell, D ;
Lee, V ;
Dimmer, E ;
Maslen, J ;
Binns, D ;
Harte, N ;
Lopez, R ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D262-D266
[6]  
Camon EB, 2005, BMC BIOINFORMATICS, V6, DOI 10.1186/1471-2105-6-S1-S17
[7]   Intrinsic errors in genome annotation [J].
Devos, D ;
Valencia, A .
TRENDS IN GENETICS, 2001, 17 (08) :429-431
[8]  
Galperin M Y, 1998, In Silico Biol, V1, P55
[9]   Modeling the percolation of annotation errors in a database of protein sequences [J].
Gilks, WR ;
Audit, B ;
De Angelis, D ;
Tsoka, S ;
Ouzounis, CA .
BIOINFORMATICS, 2002, 18 (12) :1641-1649
[10]   Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers [J].
Green, ML ;
Karp, PD .
NUCLEIC ACIDS RESEARCH, 2005, 33 (13) :4035-4039