Automatic extraction of protein point mutations using a Graph Bigram association

被引：28

作者：

Lee, Lawrence C.

Horn, Florence

Cohen, Fred E. ^{[1
]}

机构：

[1] Univ Calif San Francisco, Dept Cellular & Mol Pharmacol, San Francisco, CA 94143 USA

[2] Univ Calif San Francisco, Biomed Informat, San Francisco, CA 94143 USA

[3] Commissariat Energie Atom, Lab Biol Informat & Math, Grenoble, France

[4] Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA 94143 USA

来源：

PLOS COMPUTATIONAL BIOLOGY | 2007年 / 3卷 / 02期

关键词：

D O I：

10.1371/journal.pcbi.0030016

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Protein point mutations are an essential component of the evolutionary and experimental analysis of protein structure and function. While many manually curated databases attempt to index point mutations, most experimentally generated point mutations and the biological impacts of the changes are described in the peer-reviewed published literature. We describe an application, Mutation GraB ( Graph Bigram), that identifies, extracts, and verifies point mutations from biomedical literature. The principal problem of point mutation extraction is to link the point mutation with its associated protein and organism of origin. Our algorithm uses a graph-based bigram traversal to identify these relevant associations and exploits the Swiss-Prot protein database to verify this information. The graph bigram method is different from other models for point mutation extraction in that it incorporates frequency and positional data of all terms in an article to drive the point mutation-protein association. Our method was tested on 589 articles describing point mutations from the G protein-coupled receptor ( GPCR), tyrosine kinase, and ion channel protein families. We evaluated our graph bigram metric against a word-proximity metric for term association on datasets of full-text literature in these three different protein families. Our testing shows that the graph bigram metric achieves a higher F-measure for the GPCRs (0.79 versus 0.76), protein tyrosine kinases (0.72 versus 0.69), and ion channel transporters ( 0.76 versus 0.74). Importantly, in situations where more than one protein can be assigned to a point mutation and disambiguation is required, the graph bigram metric achieves a precision of 0.84 compared with the word distance metric precision of 0.73. We believe the graph bigram search metric to be a significant improvement over previous search metrics for point mutation extraction and to be applicable to text-mining application requiring the association of words.

引用

页码：184 / 198

页数：15

共 27 条

[1] BAKER CJO, 2004, P 3 CAN WORK C COMP, V1, P47
[2] BAKER CJO, 2006, INFO SYST FRONTIERS, P47
[3] Blaschke C, 1999, Proc Int Conf Intell Syst Mol Biol, P60
[4] The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Boeckmann, B
Bairoch, A
Apweiler, R
Blatter, MC
Estreicher, A
Gasteiger, E
Martin, MJ
Michoud, K
O'Donovan, C
Phan, I
Pilbout, S
Schneider, M
[J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 365 - 370
[5] GAPSCORE:: finding gene and protein names one word at a time
Chang, JT
Schütze, H
Altman, RB
[J]. BIOINFORMATICS, 2004, 20 (02) : 216 - 225
[6] The HUGO Mutation Database Initiative
Cotton R.G.H.
Horaitis O.
[J]. The Pharmacogenomics Journal, 2002, 2 (1) : 16 - 19
[7] Automatically annotating documents with normalized gene lists
Crim, J
McDonald, R
Pereira, F
[J]. BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[8] tGRAP, the G-protein coupled receptors mutant database
Edvardsen, O
Reiersen, AL
Beukers, MW
Kristiansen, K
[J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 361 - 363
[9] FRIEDMAN C, 2001, BIOINFORMATICS S1, V17, P74
[10] A simple approach for protein name identification:: prospects and limits
Fundel, K
Güttler, D
Zimmer, R
Apostolakis, J
[J]. BMC BIOINFORMATICS, 2005, 6 (Suppl 1)

← 1 2 3 →