A Novel Word Clustering and Cluster Merging Technique for Named Entity Recognition

被引:2
作者
Patra, Rakesh [1 ]
Saha, Sujan Kumar [1 ]
机构
[1] Birla Inst Technol Mesra, Dept Comp Sci & Engn, Ranchi, Bihar, India
关键词
Word clustering; brown clustering; hierarchical clustering; cluster merging; named entity recognition;
D O I
10.1515/jisys-2016-0074
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present a novel word clustering technique to capture contextual similarity among the words. Related word clustering techniques in the literature rely on the statistics of the words collected from a fixed and small word window. For example, the Brown clustering algorithm is based on bigram statistics of the words. However, in the sequential labeling tasks such as named entity recognition (NER), longer context words also carry valuable information. To capture this longer context information, we propose a new word clustering algorithm, which uses parse information of the sentences and a nonfixed word window. This proposed clustering algorithm, named as variable window clustering, performs better than Brown clustering in our experiments. Additionally, to use two different clustering techniques simultaneously in a classifier, we propose a cluster merging technique that performs an output level merging of two sets of clusters. To test the effectiveness of the approaches, we use two different NER data sets, namely, Hindi and BioCreative II Gene Mention Recognition. A baseline NER system is developed using conditional random fields classifier, and then the clusters using individual techniques as well as the merged technique are incorporated to improve the classifier. Experimental results demonstrate that the cluster merging technique is quite promising.
引用
收藏
页码:15 / 30
页数:16
相关论文
共 32 条
[1]  
Ando R., 2007, P 2 BIOCREATIVE CHAL, P101
[2]  
[Anonymous], 2006, P 2006 C EMPIRICAL M
[3]  
[Anonymous], GENOME BIOL
[4]  
[Anonymous], 2008, P IJCNLP 2008
[5]  
[Anonymous], P 31 ANN M OH STAT U
[6]  
[Anonymous], 2008, 3 INT JOINT C NATURA
[7]  
[Anonymous], 2005, THESIS MIT
[8]  
Biemann C, 2006, P TEXTGRAPHS 1 WORKS, P7380
[9]  
Brown P. F., 1992, Computational Linguistics, V18, P467
[10]  
Chieu HaiLeong., 2002, Proceedings of the 19th international conference on Computational linguistics-, V1, P1