Deep learning with word embeddings improves biomedical named entity recognition

被引:339
作者
Habibi, Maryam [1 ]
Weber, Leon [1 ]
Neves, Mariana [2 ]
Wiegandt, David Luis [1 ]
Leser, Ulf [1 ]
机构
[1] Humboldt Univ, Dept Comp Sci, D-10099 Berlin, Germany
[2] Hasso Plattner Inst, Enterprise Platform & Integrat Concepts, D-14482 Potsdam, Germany
关键词
D O I
10.1093/bioinformatics/btx228
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
引用
收藏
页码:I37 / I48
页数:12
相关论文
共 65 条
  • [51] Okazaki N., 2007, CRFsuite: a fast implementation of conditional random fields (CRFs)
  • [52] The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text
    Pafilis, Evangelos
    Frankild, Sune P.
    Fanini, Lucia
    Faulwetter, Sarah
    Pavloudi, Christina
    Vasileiadou, Aikaterini
    Arvanitidis, Christos
    Jensen, Lars Juhl
    [J]. PLOS ONE, 2013, 8 (06):
  • [53] Pascanu R., 2014, 2 INT C LEARN REPR I
  • [54] BioInfer:: a corpus for information extraction in the biomedical domain
    Pyysalo, Sampo
    Ginter, Filip
    Heimonen, Juho
    Bjorne, Jari
    Boberg, Jorma
    Jarvinen, Jouni
    Salakoski, Tapio
    [J]. BMC BIOINFORMATICS, 2007, 8 (1)
  • [55] A TUTORIAL ON HIDDEN MARKOV-MODELS AND SELECTED APPLICATIONS IN SPEECH RECOGNITION
    RABINER, LR
    [J]. PROCEEDINGS OF THE IEEE, 1989, 77 (02) : 257 - 286
  • [56] ChemSpot: a hybrid system for chemical named entity recognition
    Rocktaschel, Tim
    Weidlich, Michael
    Leser, Ulf
    [J]. BIOINFORMATICS, 2012, 28 (12) : 1633 - 1640
  • [57] Sang E.T. K., 2003, P 7 C NAT LANG LEARN, P142
  • [58] ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text
    Settles, B
    [J]. BIOINFORMATICS, 2005, 21 (14) : 3191 - 3192
  • [59] Overview of BioCreative II gene mention recognition
    Smith, Larry
    Tanabe, Lorraine K.
    Johnson Nee Ando, Rie
    Kuo, Cheng-Ju
    Chung, I-Fang
    Hsu, Chun-Nan
    Lin, Yu-Shi
    Klinger, Roman
    Friedrich, Christoph M.
    Ganchev, Kuzman
    Torii, Manabu
    Liu, Hongfang
    Haddow, Barry
    Struble, Craig A.
    Povinelli, Richard J.
    Vlachos, Andreas
    Baumgartner, William A., Jr.
    Hunter, Lawrence
    Carpenter, Bob
    Tsai, Richard Tzong-Han
    Dai, Hong-Jie
    Liu, Feng
    Chen, Yifei
    Sun, Chengjie
    Katrenko, Sophia
    Adriaans, Pieter
    Blaschke, Christian
    Torres, Rafael
    Neves, Mariana
    Nakov, Preslav
    Divoli, Anna
    Mana-Lopez, Manuel
    Mata, Jacinto
    Wilbur, W. John
    [J]. GENOME BIOLOGY, 2008, 9
  • [60] Thole U., 1979, Fuzzy Sets and Systems, V2, P167, DOI 10.1016/0165-0114(79)90023-X