Gene annotation from scientific literature using mappings between keyword systems

被引:32
作者
Pérez, AJ
Perez-Iratxeta, C
Bork, P
Thode, G
Andrade, MA
机构
[1] Univ Malaga, Fac Ciencias, Dept Genet, Grp Bioinformat, E-29071 Malaga, Spain
[2] European Mol Biol Lab, D-69117 Heidelberg, Germany
[3] Max Delbruck Ctr Mol Med, Dept Bioinformat, D-13092 Berlin, Germany
关键词
D O I
10.1093/bioinformatics/bth207
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The description of genes in databases by keywords helps the non-specialist to quickly grasp the properties of a gene and increases the efficiency of computational tools that are applied to gene data (e.g. searching a gene database for sequences related to a particular biological process). However, the association of keywords to genes or protein sequences is a difficult process that ultimately implies examination of the literature related to a gene. Results: To support this task, we present a procedure to derive keywords from the set of scientific abstracts related to a gene. Our system is based on the automated extraction of mappings between related terms from different databases using a model of fuzzy associations that can be applied with all generality to any pair of linked databases. We tested the system by annotating genes of the SWISS-PROT database with keywords derived from the abstracts linked to their entries (stored in the MEDLINE database of scientific references). The performance of the annotation procedure was much better for SWISS-PROT keywords (recall of 47%, precision of 68%) than for Gene Ontology terms (recall of 8%, precision of 67%).
引用
收藏
页码:2084 / 2091
页数:8
相关论文
共 16 条
  • [1] Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families
    Andrade, MA
    Valencia, A
    [J]. BIOINFORMATICS, 1998, 14 (07) : 600 - 607
  • [2] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [3] The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
    Boeckmann, B
    Bairoch, A
    Apweiler, R
    Blatter, MC
    Estreicher, A
    Gasteiger, E
    Martin, MJ
    Michoud, K
    O'Donovan, C
    Phan, I
    Pilbout, S
    Schneider, M
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 365 - 370
  • [4] The gene ontology annotation (GOA) project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro
    Camon, E
    Magrane, M
    Barrell, D
    Binns, D
    Fleischmann, W
    Kersey, P
    Mulder, N
    Oinn, T
    Maslen, J
    Cox, A
    Apweiler, R
    [J]. GENOME RESEARCH, 2003, 13 (04) : 662 - 672
  • [5] Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT
    Kretschmann, E
    Fleischmann, W
    Apweiler, R
    [J]. BIOINFORMATICS, 2001, 17 (10) : 920 - 926
  • [6] MIYAMOTO S, 1990, FUZZY SETS INFORMATI
  • [7] Predicting protein cellular localization using a domain projection method
    Mott, R
    Schultz, J
    Bork, P
    Ponting, CP
    [J]. GENOME RESEARCH, 2002, 12 (08) : 1168 - 1174
  • [8] A computational strategy for protein function assignment which addresses the multidomain problem
    Pérez, AJ
    Rodríguez, A
    Trelles, O
    Thode, G
    [J]. COMPARATIVE AND FUNCTIONAL GENOMICS, 2002, 3 (05): : 423 - 440
  • [9] Perez-Iratxeta C, 2003, BIOINFORMATICS AND GENOMES: CURRENT PERSPECTIVES, P141
  • [10] Computing fuzzy associations for the analysis of biological literature
    Perez-Iratxeta, C
    Keer, HS
    Bork, P
    Andrade, MA
    [J]. BIOTECHNIQUES, 2002, 32 (06) : 1380 - +