Deep learning with word embeddings improves biomedical named entity recognition

被引:350
作者
Habibi, Maryam [1 ]
Weber, Leon [1 ]
Neves, Mariana [2 ]
Wiegandt, David Luis [1 ]
Leser, Ulf [1 ]
机构
[1] Humboldt Univ, Dept Comp Sci, D-10099 Berlin, Germany
[2] Hasso Plattner Inst, Enterprise Platform & Integrat Concepts, D-14482 Potsdam, Germany
关键词
D O I
10.1093/bioinformatics/btx228
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
引用
收藏
页码:I37 / I48
页数:12
相关论文
共 65 条
[51]  
Okazaki N., 2007, CRFsuite: a fast implementation of conditional random fields (CRFs)
[52]   The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text [J].
Pafilis, Evangelos ;
Frankild, Sune P. ;
Fanini, Lucia ;
Faulwetter, Sarah ;
Pavloudi, Christina ;
Vasileiadou, Aikaterini ;
Arvanitidis, Christos ;
Jensen, Lars Juhl .
PLOS ONE, 2013, 8 (06)
[53]  
Pascanu R., 2014, 2 INT C LEARN REPR I
[54]   BioInfer:: a corpus for information extraction in the biomedical domain [J].
Pyysalo, Sampo ;
Ginter, Filip ;
Heimonen, Juho ;
Bjorne, Jari ;
Boberg, Jorma ;
Jarvinen, Jouni ;
Salakoski, Tapio .
BMC BIOINFORMATICS, 2007, 8 (1)
[55]   A TUTORIAL ON HIDDEN MARKOV-MODELS AND SELECTED APPLICATIONS IN SPEECH RECOGNITION [J].
RABINER, LR .
PROCEEDINGS OF THE IEEE, 1989, 77 (02) :257-286
[56]   ChemSpot: a hybrid system for chemical named entity recognition [J].
Rocktaschel, Tim ;
Weidlich, Michael ;
Leser, Ulf .
BIOINFORMATICS, 2012, 28 (12) :1633-1640
[57]  
Sang E.T. K., 2003, P 7 C NAT LANG LEARN, P142
[58]   ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text [J].
Settles, B .
BIOINFORMATICS, 2005, 21 (14) :3191-3192
[59]   Overview of BioCreative II gene mention recognition [J].
Smith, Larry ;
Tanabe, Lorraine K. ;
Johnson Nee Ando, Rie ;
Kuo, Cheng-Ju ;
Chung, I-Fang ;
Hsu, Chun-Nan ;
Lin, Yu-Shi ;
Klinger, Roman ;
Friedrich, Christoph M. ;
Ganchev, Kuzman ;
Torii, Manabu ;
Liu, Hongfang ;
Haddow, Barry ;
Struble, Craig A. ;
Povinelli, Richard J. ;
Vlachos, Andreas ;
Baumgartner, William A., Jr. ;
Hunter, Lawrence ;
Carpenter, Bob ;
Tsai, Richard Tzong-Han ;
Dai, Hong-Jie ;
Liu, Feng ;
Chen, Yifei ;
Sun, Chengjie ;
Katrenko, Sophia ;
Adriaans, Pieter ;
Blaschke, Christian ;
Torres, Rafael ;
Neves, Mariana ;
Nakov, Preslav ;
Divoli, Anna ;
Mana-Lopez, Manuel ;
Mata, Jacinto ;
Wilbur, W. John .
GENOME BIOLOGY, 2008, 9
[60]  
Thole U., 1979, Fuzzy Sets and Systems, V2, P167, DOI 10.1016/0165-0114(79)90023-X