Long distance bigram models applied to word clustering

被引：17

作者：

Bassiou, Nikoletta ^{[1
]}

Kotropoulos, Constantine ^{[1
]}

机构：

[1] Aristotle Univ Thessaloniki, Dept Informat, Thessaloniki 54124, Greece

来源：

PATTERN RECOGNITION | 2011年 / 44卷 / 01期

关键词：

Word clustering; Language modeling; Distance bigrams; Probabilistic latent semantic analysis; Relative cluster validity indices; Trigger-pairs; Spectral clustering; Cluster dispersion; Cluster sense precision; Cluster sense recall; WordNet;

D O I：

10.1016/j.patcog.2010.07.006

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Two novel word clustering techniques are proposed which employ long distance bigram language models. The first technique is built on a hierarchical clustering algorithm and minimizes the sum of Mahalanobis distances of all words after a cluster merger from the centroid of the class created by merging. The second technique resorts to probabilistic latent semantic analysis (PLSA). Next, interpolated long distance bigrams are considered in the context of the aforementioned clustering techniques. Experiments conducted on the English Gigaword corpus (second edition) demonstrate that: (1) the long distance bigrams, when employed in the two clustering techniques under study, yield word clusters of better quality than the baseline bigrams; (2) the interpolated long distance bigrams outperform the long distance bigrams in the same respect; (3) the long distance bigrams perform better than the bigrams, which incorporate trigger-pairs selected at various distances; and (4) the best word clustering is achieved by the PLSA that employs interpolated long distance bigrams. Both proposed techniques outperform spectral clustering based on k-means. To assess objectively the quality of the created clusters, relative cluster validity indices are estimated as well as the average cluster sense precision, the average cluster sense recall, and the F-measure are computed by exploiting ground truth extracted from the WordNet. (C) 2010 Elsevier Ltd. All rights reserved.

引用

页码：145 / 158

页数：14

共 46 条

[1]

[Anonymous], 1993, 31 ANN M ASS COMPUTA, DOI [10.3115/981574.981598, DOI 10.3115/981574.981598]

[2]

[Anonymous], P 23 EUR C INF RETR

[3]

[Anonymous], 1998, PROC BROADCAST NEWS

[4]

[Anonymous], Probability, Random Variables and Stochastic Processes

[5]

Baker L. D., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P96, DOI 10.1145/290941.290970

[6]

BASSIOU N, 2005, P NONL SIGN IM PROC, P12

[7]

BECCETTI C, 1999, SPEECH RECOGNITION S

[8]

Bellegarda JR, 1998, IEEE T SPEECH AUDI P, V6, P456, DOI 10.1109/89.709671

[9]

Bezdek J. C., 1975, P 8 ANN INT C NUM TA, P143

[10]

BORDAG S, 2006, P 11 C EUR CHAPT ASS

← 1 2 3 4 5 →