A New Word Clustering Algorithm Based on Word Similarity

被引:0
作者
YUAN Lichi
机构
[1] SchoolofInformationTechnology,JiangxiUniversityofFinanceandEconomics
关键词
Word similarity; Word clustering; Statistical language model;
D O I
暂无
中图分类号
TP391.1 [文字信息处理];
学科分类号
081203 ; 0835 ;
摘要
Category-based statistic language model is an important method to solve the problem of sparse data in statistical language models. But there are two bottlenecks about this model: 1) The problem of word clustering, it is hard to find a suitable clustering method that has good performance and has not large amount of computation; 2)Class-based method always loses some prediction ability to adapt the text of different domain. In order to solve above problems, a novel definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given and a bottom-up hierarchical clustering algorithm was proposed.Experimental results show that the word clustering algorithm based on word similarity is better than conventional greedy clustering method in speed and performance, the perplexity is reduced from 283 to 207.8.
引用
收藏
页码:1221 / 1226
页数:6
相关论文
共 50 条
[31]   A comparative study on Chinese word clustering [J].
Wang, Bo ;
Wang, Houfeng .
COMPUTER PROCESSING OF ORIENTAL LANGUAGES, PROCEEDINGS: BEYOND THE ORIENT: THE RESEARCH CHALLENGES AHEAD, 2006, 4285 :157-+
[32]   Word Similarity Datasets for Thai: Construction and Evaluation [J].
Netisopakul, Ponrudee ;
Wohlgenannt, Gerhard ;
Pulich, Aleksei .
IEEE ACCESS, 2019, 7 :142907-142915
[33]   Dual embeddings and metrics for word and relational similarity [J].
Dandan Li ;
Douglas Summers-Stay .
Annals of Mathematics and Artificial Intelligence, 2020, 88 :533-547
[34]   Dual embeddings and metrics for word and relational similarity [J].
Li, Dandan ;
Summers-Stay, Douglas .
ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2020, 88 (5-6) :533-547
[35]   Unsupervised Approaches for Computing Word Similarity in Portuguese [J].
Oliveira, Hugo Goncalo .
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2017), 2017, 10423 :828-840
[36]   A Multidisciplinary Method for Constructing and Validating Word Similarity [J].
Wan, Yu ;
Chen, Yidong ;
Shi, Xiaodong ;
Cai, Guorong ;
Cai, Libai .
ADVANCES IN COMPUTATIONAL INTELLIGENCE SYSTEMS, 2018, 650 :37-48
[37]   Word clustering based on POS feature for efficient twitter sentiment analysis [J].
Wang, Yili ;
Kim, KyungTae ;
Lee, ByungJun ;
Youn, Hee Yong .
HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2018, 8
[38]   CNN based Sentence Classification with Semantic Features using Word Clustering [J].
Kim, Hwa-Yeon ;
Lee, Jinsu ;
Yeo, Na Young ;
Astrid, Marcella ;
Lee, Seung-Ik ;
Kim, Young-Kil .
2018 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2018, :484-488
[39]   A Word Similarity Based Belief Network IR Model with Two Term Layers [J].
Xu, J. M. ;
Tang, W. S. ;
Xu, J. M. ;
Chen, Z. Y. ;
Luo, Z. H. .
PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL II, 2009, :514-+
[40]   A Chinese word dividing algorithm based on statistical language models [J].
Tian, B ;
Cheung, J ;
Yi, KC ;
Wang, H .
ICSP '96 - 1996 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PROCEEDINGS, VOLS I AND II, 1996, :805-808