On the Strength of Character Language Models for Multilingual Named Entity Recognition

被引:0
作者
Yu, Xiaodong [1 ]
Mayhew, Stephen [2 ]
Sammons, Mark [1 ]
Roth, Dan [2 ]
机构
[1] Univ Illinois, Champaign, IL 61820 USA
[2] Univ Penn, Philadelphia, PA 19104 USA
来源
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018) | 2018年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and non-name tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens. We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages.(1)
引用
收藏
页码:3073 / 3077
页数:5
相关论文
共 13 条
[1]  
Cucerzan Silviu, 1999, EMNLP
[2]  
Khashabi Daniel, 2018, 11 LANG RES EV C
[3]  
Klein Dan, 2003, CONLL
[4]  
Lample G, 2016, P NAACL HLT, P260, DOI DOI 10.18653/V1/N16-1030
[5]  
Ling Xiao, 2012, P NAT C ART INT AAAI
[6]  
Paul Baltescu, 2014, Prague Bulletin of Mathematical Linguistics, P81, DOI 10.2478/pralin-2014-0016
[7]  
Peng Haoruo, 2016, P ANN M ASS COMP LIN
[8]  
Ratinov L., 2009, P C COMP NAT LANG LE
[9]  
Sang Erik F. Tjong Kim, 2003, CONLL
[10]  
Smarr Joseph, 2002, CLASSIFYING UNKNOWN