Deep learning with word embeddings improves biomedical named entity recognition

被引：350

作者：

Habibi, Maryam ^{[1
]}

Weber, Leon ^{[1
]}

Neves, Mariana ^{[2
]}

Wiegandt, David Luis ^{[1
]}

Leser, Ulf ^{[1
]}

机构：

[1] Humboldt Univ, Dept Comp Sci, D-10099 Berlin, Germany

[2] Hasso Plattner Inst, Enterprise Platform & Integrat Concepts, D-14482 Potsdam, Germany

来源：

BIOINFORMATICS | 2017年 / 33卷 / 14期

关键词：

D O I：

10.1093/bioinformatics/btx228

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.

引用

页码：I37 / I48

页数：12

共 65 条

[1] Gene prioritization through genomic data fusion [J].

Aerts, S ;

Lambrechts, D ;

Maity, S ;

Van Loo, P ;

Coessens, B ;

De Smet, F ;

Tranchevent, LC ;

De Moor, B ;

Marynen, P ;

Hassan, B ;

Carmeliet, P ;

Moreau, Y .

NATURE BIOTECHNOLOGY, 2006, 24 (05) :537-544

[2] Annotated Chemical Patent Corpus: A Gold Standard for Text Mining [J].

Akhondi, Saber A. ;

Klenner, Alexander G. ;

Tyrchan, Christian ;

Manchala, Anil K. ;

Boppana, Kiran ;

Lowe, Daniel ;

Zimmermann, Marc ;

Jagarlapudi, Sarma A. R. P. ;

Sayle, Roger ;

Kors, Jan A. ;

Muresan, Sorel .

PLOS ONE, 2014, 9 (09)

[3]

[Anonymous], P 7 INT WORKSH HLTH

[4]

[Anonymous], PAC S BIOCOMPUT

[5]

[Anonymous], 2016, ARXIV160202410V2

[6]

[Anonymous], 2012, P 3 WORKSHOP BUILDIN

[7]

[Anonymous], 2014, BIOMED RES INT

[8]

[Anonymous], 2013, BMC BIOINFORMATICS

[9]

[Anonymous], 2015, BMC P

[10]

[Anonymous], 2001, Springer series in statistics New York

← 1 2 3 4 5 6 7 →