Named entity recognition with multiple segment representations

被引:42
作者
Cho, Han-Cheol [1 ]
Okazaki, Naoaki [2 ]
Miwa, Makoto [3 ]
Tsujii, Jun'ichi [4 ]
机构
[1] Univ Tokyo, Dept Comp Sci, Suda Lab, Bunkyo Ku, Tokyo 1138656, Japan
[2] Tohoku Univ, Dept Syst Informat Sci, Inui & Okazaki Lab, Aoba Ku, Sendai, Miyagi 9808579, Japan
[3] Manchester Interdisciplinary Bioctr, Natl Ctr Text Min, Manchester M1 7DN, Lancs, England
[4] Microsoft Res Asia, Beijing 1000080, Peoples R China
关键词
Named entity recognition; Machine learning; Conditional random fields; Feature engineering; GENE;
D O I
10.1016/j.ipm.2013.03.002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Named entity recognition (NER) is mostly formalized as a sequence labeling problem in which segments of named entities are represented by label sequences. Although a considerable effort has been made to investigate sophisticated features that encode textual characteristics of named entities (e.g. PEOPLE, LOCATION, etc.), little attention has been paid to segment representations (SRs) for multi-token named entities (e.g. the IOB2 notation). In this paper, we investigate the effects of different SRs on NER tasks, and propose a feature generation method using multiple SRs. The proposed method allows a model to exploit not only highly discriminative features of complex SRs but also robust features of simple SRs against the data sparseness problem. Since it incorporates different SRs as feature functions of Conditional Random Fields (CRFs), we can use the well-established procedure for training. In addition, the tagging speed of a model integrating multiple SRs can be accelerated equivalent to that of a model using only the most complex SR of the integrated model. Experimental results demonstrate that incorporating multiple SRs into a single model improves the performance and the stability of NER. We also provide the detailed analysis of the results. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:954 / 965
页数:12
相关论文
共 20 条
[1]  
[Anonymous], 2003, IJCLCLP
[2]  
[Anonymous], 2001, PROC 18 INT C MACH L
[3]   The Unified Medical Language System (UMLS): integrating biomedical terminology [J].
Bodenreider, O .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270
[4]   Integrating high dimensional bi-directional parsing models for gene mention tagging [J].
Hsu, Chun-Nan ;
Chang, Yu-Ming ;
Kuo, Cheng-Ju ;
Lin, Yu-Shi ;
Huang, Han-Shen ;
Chung, I-Fang .
BIOINFORMATICS, 2008, 24 (13) :I286-I294
[5]  
Kambhatla N, 2006, COLING ACL 2006 21 I, P460
[6]  
Kazama Junichi., 2007, P 2007 JOINT C EMPIR, P698
[7]  
Kudo T., 2001, P 2 C NAACL, P1
[8]  
Leaman Robert, 2008, Pac Symp Biocomput, P652
[9]   Biomedical named entity recognition using two-phase model based on SVMs [J].
Lee, KJ ;
Hwang, YS ;
Kim, S ;
Rim, HC .
JOURNAL OF BIOMEDICAL INFORMATICS, 2004, 37 (06) :436-447
[10]   Incorporating rich background knowledge for gene named entity classification and recognition [J].
Li, Yanpeng ;
Lin, Hongfei ;
Yang, Zhihao .
BMC BIOINFORMATICS, 2009, 10