Deep learning with word embeddings improves biomedical named entity recognition

被引:350
作者
Habibi, Maryam [1 ]
Weber, Leon [1 ]
Neves, Mariana [2 ]
Wiegandt, David Luis [1 ]
Leser, Ulf [1 ]
机构
[1] Humboldt Univ, Dept Comp Sci, D-10099 Berlin, Germany
[2] Hasso Plattner Inst, Enterprise Platform & Integrat Concepts, D-14482 Potsdam, Germany
关键词
D O I
10.1093/bioinformatics/btx228
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
引用
收藏
页码:I37 / I48
页数:12
相关论文
共 65 条
[1]   Gene prioritization through genomic data fusion [J].
Aerts, S ;
Lambrechts, D ;
Maity, S ;
Van Loo, P ;
Coessens, B ;
De Smet, F ;
Tranchevent, LC ;
De Moor, B ;
Marynen, P ;
Hassan, B ;
Carmeliet, P ;
Moreau, Y .
NATURE BIOTECHNOLOGY, 2006, 24 (05) :537-544
[2]   Annotated Chemical Patent Corpus: A Gold Standard for Text Mining [J].
Akhondi, Saber A. ;
Klenner, Alexander G. ;
Tyrchan, Christian ;
Manchala, Anil K. ;
Boppana, Kiran ;
Lowe, Daniel ;
Zimmermann, Marc ;
Jagarlapudi, Sarma A. R. P. ;
Sayle, Roger ;
Kors, Jan A. ;
Muresan, Sorel .
PLOS ONE, 2014, 9 (09)
[3]  
[Anonymous], P 7 INT WORKSH HLTH
[4]  
[Anonymous], PAC S BIOCOMPUT
[5]  
[Anonymous], 2016, ARXIV160202410V2
[6]  
[Anonymous], 2012, P 3 WORKSHOP BUILDIN
[7]  
[Anonymous], 2014, BIOMED RES INT
[8]  
[Anonymous], 2013, BMC BIOINFORMATICS
[9]  
[Anonymous], 2015, BMC P
[10]  
[Anonymous], 2001, Springer series in statistics New York