Building a Corpus-Derived Gazetteer for Named Entity Recognition

被引:0
作者
Zamin, Norshuhani [1 ]
Oxley, Alan [1 ]
机构
[1] Univ Teknol PETRONAS, Dept Comp & Informat Sci, Tronoh 31750, Perak, Malaysia
来源
SOFTWARE ENGINEERING AND COMPUTER SYSTEMS, PT 2 | 2011年 / 180卷
关键词
Gazetteer; Named Entity Recognition; Natural Language Processing; Terrorism;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Gazetteers, or entity dictionaries, are an important element for Named Entity Recognition. Named Entity Recognition is an essential component of Information Extraction. Gazetteers work as specialized dictionaries to support initial tagging. They provide quick entity identification thus creating richer document representation. However, the compilation of such gazetteers is sometimes mentioned as a stumbling block in Named Entity Recognition. Machine learning, both rule-based and look-up based approaches, are often used to perform this process. In this paper, a gazetteer developed from MUC-3 annotated data for the 'person named' entity type is presented. The process used has a small computational cost. We combine rule-based grammars and a simple filtering technique for automatically inducing the gazetteer. We conclude with experiments to compare the content of the gazetteer with the manually crafted one.
引用
收藏
页码:73 / 80
页数:8
相关论文
共 19 条
[1]  
[Anonymous], 2005, P 43 ANN M ASS COMP, DOI DOI 10.3115/1219840.1219885
[2]  
[Anonymous], 2008, 3 INT JOINT C NATURA
[3]  
Brill E, 1995, COMPUT LINGUIST, V21, P543
[4]   Unsupervised named-entity extraction from the Web: An experimental study [J].
Etzioni, O ;
Cafarella, M ;
Downey, D ;
Popescu, AM ;
Shaked, T ;
Soderland, S ;
Weld, DS ;
Yates, A .
ARTIFICIAL INTELLIGENCE, 2005, 165 (01) :91-134
[5]  
Feldman R., 2006, TEXT MINING HDB ADV
[6]  
HEARST MA, 1992, INT C COMP LING NANT, P539
[7]  
Krieger H.U., 2010, 23 INT C COMP LING, P588
[8]  
Mikheev A, 1999, NINTH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS, P1
[9]  
Minkov E., 2005, P HUMAN LANGUAGE TEC, P443
[10]  
Nadeau David., 2007, THESIS U OTTAWA CANA