MER: a shell script and annotation server for minimal named entity recognition and linking

被引:0
作者
Francisco M. Couto
Andre Lamurias
机构
[1] Universidade de Lisboa,LASIGE, Faculdade de Ciências
[2] University of Lisboa,Faculty of Sciences, BioISI
来源
Journal of Cheminformatics | / 10卷
关键词
Named-entity recognition; Entity linking; Annotation server; Text mining; Biomedical ontologies; Lexicon;
D O I
暂无
中图分类号
学科分类号
摘要
Named-entity recognition aims at identifying the fragments of text that mention entities of interest, that afterwards could be linked to a knowledge base where those entities are described. This manuscript presents our minimal named-entity recognition and linking tool (MER), designed with flexibility, autonomy and efficiency in mind. To annotate a given text, MER only requires: (1) a lexicon (text file) with the list of terms representing the entities of interest; (2) optionally a tab-separated values file with a link for each term; (3) and a Unix shell. Alternatively, the user can provide an ontology from where MER will automatically generate the lexicon and links files. The efficiency of MER derives from exploring the high performance and reliability of the text processing command-line tools grep and awk, and a novel inverted recognition technique. MER was deployed in a cloud infrastructure using multiple Virtual Machines to work as an annotation server and participate in the Technical Interoperability and Performance of annotation Servers task of BioCreative V.5. The results show that our solution processed each document (text retrieval and annotation) in less than 3 s on average without using any type of cache. MER was also compared to a state-of-the-art dictionary lookup solution obtaining competitive results not only in computational performance but also in precision and recall. MER is publicly available in a GitHub repository (https://github.com/lasigeBioTM/MER) and through a RESTful Web service (http://labs.fc.ul.pt/mer/).
引用
收藏
相关论文
共 186 条
[1]  
Nadeau D(2007)A survey of named entity recognition and classification Lingvist Investig 30 3-26
[2]  
Sekine S(2017)Information retrieval and text mining technologies for chemistry Chem Rev 117 7673-7761
[3]  
Krallinger M(1994)The lexical nature of syntactic ambiguity resolution Psychol Rev 101 676-483
[4]  
Rabal O(2015)Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization J Cheminform 7 14-265
[5]  
Lourenço A(2015)The CHEMDNER corpus of chemicals and drugs and its annotation principles J Cheminform 7 2-12
[6]  
Oyarzabal J(2016)Extract: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation Database 2016 baw005-545
[7]  
Valencia A(2016)Improving the dictionary lookup approach for disease normalization using enhanced dictionary and query expansion Database 2016 baw112-13
[8]  
MacDonald MC(2017)Olelo: a web application for intuitive exploration of biomedical literature Nucl Acids Res 45 478-340
[9]  
Pearlmutter NJ(2014)Ontogene web services for biomedical text mining BMC Bioinform 15 6-350
[10]  
Seidenberg MS(2002)Reading behaviour and electronic journals Learn Publ 15 259-974