An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

被引:11
作者
Bell, Michael J. [1 ]
Gillespie, Colin S. [2 ]
Swan, Daniel [3 ]
Lord, Phillip [1 ]
机构
[1] Newcastle Univ, Sch Comp Sci, Newcastle Upon Tyne NE1 7RU, Tyne & Wear, England
[2] Newcastle Univ, Sch Math & Stat, Newcastle Upon Tyne NE1 7RU, Tyne & Wear, England
[3] Newcastle Univ, Bioinformat Support Unit, ICAMB, Sch Med, Newcastle Upon Tyne NE1 7RU, Tyne & Wear, England
基金
英国工程与自然科学研究理事会;
关键词
GENE ONTOLOGY; SUPPLEMENT TREMBL; PROTEIN FUNCTION; SEQUENCE; DATABASE;
D O I
10.1093/bioinformatics/bts372
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality.
引用
收藏
页码:I562 / I568
页数:7
相关论文
共 38 条
  • [1] ADAMIC L. A., 2002, Glottometrics, V3, P143, DOI DOI 10.1109/S0SE.2014.50
  • [2] Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach
    Andorf, Carson
    Dobbs, Drena
    Honavar, Vasant
    [J]. BMC BIOINFORMATICS, 2007, 8 (1)
  • [3] Modeling Statistical Properties of Written Text
    Angeles Serrano, M.
    Flammini, Alessandro
    Menczer, Filippo
    [J]. PLOS ONE, 2009, 4 (04):
  • [4] [Anonymous], 1956, SCI INFORM THEORY
  • [5] [Anonymous], 1949, Human behaviour and the principle of least-effort
  • [6] The Universal Protein Resource (UniProt) in 2010
    Apweiler, Rolf
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Alam-Faruque, Yasmin
    Antunes, Ricardo
    Barrell, Daniel
    Bely, Benoit
    Bingley, Mark
    Binns, David
    Bower, Lawrence
    Browne, Paul
    Chan, Wei Mun
    Dimmer, Emily
    Eberhardt, Ruth
    Fedotov, Alexander
    Foulger, Rebecca
    Garavelli, John
    Huntley, Rachael
    Jacobsen, Julius
    Kleen, Michael
    Laiho, Kati
    Leinonen, Rasko
    Legge, Duncan
    Lin, Quan
    Liu, Wudong
    Luo, Jie
    Orchard, Sandra
    Patient, Samuel
    Poggioli, Diego
    Pruess, Manuela
    Corbett, Matt
    di Martino, Giuseppe
    Donnelly, Mike
    van Rensburg, Pieter
    Bairoch, Amos
    Bougueleret, Lydie
    Xenarios, Ioannis
    Altairac, Severine
    Auchincloss, Andrea
    Argoud-Puy, Ghislaine
    Axelsen, Kristian
    Baratin, Delphine
    Blatter, Marie-Claude
    Boeckmann, Brigitte
    Bolleman, Jerven
    Bollondi, Laurent
    Boutet, Emmanuel
    Quintaje, Silvia Braconi
    Breuza, Lionel
    [J]. NUCLEIC ACIDS RESEARCH, 2010, 38 : D142 - D148
  • [7] The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 1998, 26 (01) : 38 - 42
  • [8] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [9] Balasubrahmanyan V., 1996, Journal of Quantitative Linguistics, V3, P177, DOI [DOI 10.1080/09296179608599629, 10.1080/09296179608599629]
  • [10] Manual curation is not sufficient for annotation of genomic databases
    Baumgartner, William A., Jr.
    Cohen, K. Bretonnel
    Fox, Lynne M.
    Acquaah-Mensah, George
    Hunter, Lawrence
    [J]. BIOINFORMATICS, 2007, 23 (13) : I41 - I48