Automating curation using a natural language processing pipeline

被引:0
作者
Alex B. [1 ]
Grover C. [1 ]
Haddow B. [1 ]
Kabadjov M. [1 ]
Klein E. [1 ]
Matthews M. [1 ]
Tobin R. [1 ]
Wang X. [1 ]
机构
[1] School of Informatics, University of Edinburgh, Edinburgh EH8 9AB
关键词
Natural Language Processing; Name Entity Recognition; Relation Extraction; Gene Mention; Gene Lexicon;
D O I
10.1186/gb-2008-9-s2-s10
中图分类号
学科分类号
摘要
Background: The tasks in BioCreative II were designed to approximate some of the laborious work involved in curating biomedical research papers. The approach to these tasks taken by the University of Edinburgh team was to adapt and extend the existing natural language processing (NLP) system that we have developed as part of a commercial curation assistant. Although this paper concentrates on using NLP to assist with curation, the system can be equally employed to extract types of information from the literature that is immediately relevant to biologists in general. Results: Our system was among the highest performing on the interaction subtasks, and competitive performance on the gene mention task was achieved with minimal development effort. For the gene normalization task, a string matching technique that can be quickly applied to new domains was shown to perform close to average. Conclusion: The technologies being developed were shown to be readily adapted to the BioCreative II tasks. Although high performance may be obtained on individual tasks such as gene mention recognition and normalization, and document classification, tasks in which a number of components must be combined, such as detection and normalization of interacting protein pairs, are still challenging for NLP systems. © 2008 Alex et al; licensee BioMed Central Ltd.
引用
收藏
相关论文
共 32 条
[1]  
Yeh A.S., Hirschman L., Morgan A., Evaluation of text data mining for database curation: Lessons learned from the KDD Challenge Cup, Bioinformatics, 19, SUPPL. 1, (2003)
[2]  
Rebholz-Schuhmann D., Kirsch H., Couto F., Facts from text: Is text mining ready to deliver?, PLoS Biology, 3, (2005)
[3]  
Xu H., Krupke D., Blake J., Friedman C., A natural language processing (NLP) tool to assist in the curation of the laboratory mouse tumor biology database, AMIA Annu Symp Proc, (2006)
[4]  
Alex B., Haddow B., Grover C., Recognising nested named entities in biomedical text, Proceedings of BioNLP
[5]  
Prague, Czech Republic, (2007)
[6]  
Haddow B., Matthews M., The extraction of enriched protein-protein interactions from biomedical text, Proceedings of BioNLP, Prague, Czech Republic, (2007)
[7]  
Smith L., Tanabe L.K., Ando R., Kuo C.J., Chung I.F., Hsu C.N., Lin Y.S., Klinger R., Friedrich C.M., Ganchev K., Torii M., Liu H., Haddow B., Struble C.A., Povinelli R.J., Vlachos A., Baumgartner Jr. W.A., Hunter L., Carpenter B., Tsai R.T.H., Dai H.J., Liu F., Chen Y., Sun C., Katrenko S., Adriaans P., Blaschke C., Torres R., Neves M., Nakov P., Divoli A., Mana-Lopez M., Mata-Vazquez J., Wilbur W.J., Overview of BioCreative II gene mention recognition, Genome Biol, 9, SUPPL. 2, (2008)
[8]  
Morgan A.A., Lu Z., Wang X., Cohen A.M., Fluck J., Ruch P., Divoli A., Fundel K., Leaman R., Hakenberg J., Sun C., Liu H., Torres R., Krauthammer M., Lau W.W., Liu H., Hsu C.N., Schuemie M., Cohen K.B., Hirschman L., Overview of BioCreative II gene normalization, Genome Biol, 9, SUPPL. 2, (2008)
[9]  
Krallinger M., Leitner F., Rodriguez-Penagos C., Valencia A., Overview of the protein-protein interaction annotation extraction task of BioCreative II, Genome Biol, 9, SUPPL. 2, (2008)
[10]  
Lafferty J., McCallum A., Pereira F., Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of ICML, (2001)