Deep learning with word embeddings improves biomedical named entity recognition

被引：350

作者：

Habibi, Maryam ^{[1
]}

Weber, Leon ^{[1
]}

Neves, Mariana ^{[2
]}

Wiegandt, David Luis ^{[1
]}

Leser, Ulf ^{[1
]}

机构：

[1] Humboldt Univ, Dept Comp Sci, D-10099 Berlin, Germany

[2] Hasso Plattner Inst, Enterprise Platform & Integrat Concepts, D-14482 Potsdam, Germany

来源：

BIOINFORMATICS | 2017年 / 33卷 / 14期

关键词：

D O I：

10.1093/bioinformatics/btx228

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.

引用

页码：I37 / I48

页数：12

共 65 条

[51]

Okazaki N., 2007, CRFsuite: a fast implementation of conditional random fields (CRFs)

[52] The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text [J].

Pafilis, Evangelos ;

Frankild, Sune P. ;

Fanini, Lucia ;

Faulwetter, Sarah ;

Pavloudi, Christina ;

Vasileiadou, Aikaterini ;

Arvanitidis, Christos ;

Jensen, Lars Juhl .

PLOS ONE, 2013, 8 (06)

[53]

Pascanu R., 2014, 2 INT C LEARN REPR I

[54] BioInfer:: a corpus for information extraction in the biomedical domain [J].

Pyysalo, Sampo ;

Ginter, Filip ;

Heimonen, Juho ;

Bjorne, Jari ;

Boberg, Jorma ;

Jarvinen, Jouni ;

Salakoski, Tapio .

BMC BIOINFORMATICS, 2007, 8 (1)

[55] A TUTORIAL ON HIDDEN MARKOV-MODELS AND SELECTED APPLICATIONS IN SPEECH RECOGNITION [J].

RABINER, LR .

PROCEEDINGS OF THE IEEE, 1989, 77 (02) :257-286

[56] ChemSpot: a hybrid system for chemical named entity recognition [J].

Rocktaschel, Tim ;

Weidlich, Michael ;

Leser, Ulf .

BIOINFORMATICS, 2012, 28 (12) :1633-1640

[57]

Sang E.T. K., 2003, P 7 C NAT LANG LEARN, P142

[58] ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text [J].

Settles, B .

BIOINFORMATICS, 2005, 21 (14) :3191-3192

[59] Overview of BioCreative II gene mention recognition [J].

Smith, Larry ;

Tanabe, Lorraine K. ;

Johnson Nee Ando, Rie ;

Kuo, Cheng-Ju ;

Chung, I-Fang ;

Hsu, Chun-Nan ;

Lin, Yu-Shi ;

Klinger, Roman ;

Friedrich, Christoph M. ;

Ganchev, Kuzman ;

Torii, Manabu ;

Liu, Hongfang ;

Haddow, Barry ;

Struble, Craig A. ;

Povinelli, Richard J. ;

Vlachos, Andreas ;

Baumgartner, William A., Jr. ;

Hunter, Lawrence ;

Carpenter, Bob ;

Tsai, Richard Tzong-Han ;

Dai, Hong-Jie ;

Liu, Feng ;

Chen, Yifei ;

Sun, Chengjie ;

Katrenko, Sophia ;

Adriaans, Pieter ;

Blaschke, Christian ;

Torres, Rafael ;

Neves, Mariana ;

Nakov, Preslav ;

Divoli, Anna ;

Mana-Lopez, Manuel ;

Mata, Jacinto ;

Wilbur, W. John .

GENOME BIOLOGY, 2008, 9

[60]

Thole U., 1979, Fuzzy Sets and Systems, V2, P167, DOI 10.1016/0165-0114(79)90023-X

← 1 2 3 4 5 6 7 →