Deep learning with word embeddings improves biomedical named entity recognition

被引:339
作者
Habibi, Maryam [1 ]
Weber, Leon [1 ]
Neves, Mariana [2 ]
Wiegandt, David Luis [1 ]
Leser, Ulf [1 ]
机构
[1] Humboldt Univ, Dept Comp Sci, D-10099 Berlin, Germany
[2] Hasso Plattner Inst, Enterprise Platform & Integrat Concepts, D-14482 Potsdam, Germany
关键词
D O I
10.1093/bioinformatics/btx228
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
引用
收藏
页码:I37 / I48
页数:12
相关论文
共 65 条
  • [1] Gene prioritization through genomic data fusion
    Aerts, S
    Lambrechts, D
    Maity, S
    Van Loo, P
    Coessens, B
    De Smet, F
    Tranchevent, LC
    De Moor, B
    Marynen, P
    Hassan, B
    Carmeliet, P
    Moreau, Y
    [J]. NATURE BIOTECHNOLOGY, 2006, 24 (05) : 537 - 544
  • [2] Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
    Akhondi, Saber A.
    Klenner, Alexander G.
    Tyrchan, Christian
    Manchala, Anil K.
    Boppana, Kiran
    Lowe, Daniel
    Zimmermann, Marc
    Jagarlapudi, Sarma A. R. P.
    Sayle, Roger
    Kors, Jan A.
    Muresan, Sorel
    [J]. PLOS ONE, 2014, 9 (09):
  • [3] [Anonymous], P 7 INT WORKSH HLTH
  • [4] [Anonymous], PAC S BIOCOMPUT
  • [5] [Anonymous], 2016, ARXIV160202410V2
  • [6] [Anonymous], 2012, P 3 WORKSHOP BUILDIN
  • [7] [Anonymous], 2014, BIOMED RES INT
  • [8] [Anonymous], 2013, BMC BIOINFORMATICS
  • [9] [Anonymous], 2015, BMC P
  • [10] [Anonymous], 2001, Springer series in statistics New York