Automatically identifying gene/protein terms in MEDLINE abstracts

被引:21
作者
Yu, H
Hatzivassiloglou, V
Rzhetsky, A
Wilbur, WJ
机构
[1] Columbia Univ, Dept Comp Sci, New York, NY 10027 USA
[2] Columbia Univ, Columbia Genome Ctr, Dept Med Informat, New York, NY 10032 USA
[3] Natl Lib Med, Natl Ctr Biotechnol Informat, NIH, Bethesda, MD 20894 USA
关键词
automatic term recognition; synonym; mark up; information extraction; knowledge acquisition; natural language processing;
D O I
10.1016/S1532-0464(03)00032-7
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation. Results. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts. Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/.Contact. hy52@columbia.edu. Voice: 212-939-7028; fax: 212-666-0140. (C) 2003 Elsevier Science (USA). All rights reserved.
引用
收藏
页码:322 / 330
页数:9
相关论文
共 33 条
  • [1] Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families
    Andrade, MA
    Valencia, A
    [J]. BIOINFORMATICS, 1998, 14 (07) : 600 - 607
  • [2] [Anonymous], P COLING
  • [3] Antonarakis SE, 1998, HUM MUTAT, V11, P1
  • [4] Blaschke C, 1999, Proc Int Conf Intell Syst Mol Biol, P60
  • [5] BOWDEN PR, 1998, COMPUTERM98
  • [6] BRILL E, 1995, COMPUT LINGUISTICS
  • [7] FRIEDMAN PKC, 2001, ISMB
  • [8] Fukuda K, 1998, Pac Symp Biocomput, P707
  • [9] Hatzivassiloglou V, 2001, Bioinformatics, V17 Suppl 1, pS97
  • [10] HISAMITSU T, 1998, COMPUTERM98