A comparison of several statistical word clustering methods

被引：1

作者：

Yuan L. ^{[1
]}

机构：

[1] Jiangxi Key Laboratory of Data and Knowledge Engineering, School of Information Technology, Jiangxi University of Finance and Economics, Nanchang

来源：

Yuan, Lichi (yuanlichi@sohu.com) | 2016年 / Central South University of Technology卷 / 47期

基金：

中国国家自然科学基金;

关键词：

Mutual information; Natural language processing; Word clustering; Word similarity;

D O I：

10.11817/j.issn.1672-7207.2016.09.023

中图分类号：

学科分类号：

摘要：

Considering that sparse-data problem is a main issue that influences the performances of statistical language models, statistical language model based on word classes is an effective method to solve sparse-data problems. A definition of word similarity was proposed by utilizing mutual information of adjoining words, and the definition of word set similarity was given based on word similarity; a bottom-up hierarchical word clustering algorithm which can get global optimum was put forward. The results show that the word clustering algorithm has high executing speed and good clustering performances. The class-based models interpolated with the word-based models can mitigate remaining sparse-data problems of statistical language models. © 2016, Central South University Press. All right reserved.

引用

页码：3079 / 3084

页数：5

共 15 条

[1] Chen L., Huang T., A novel word clustering algorithm and vari-gram language mode, Chinese Journal of Computers, 22, 9, pp. 942-947, (1999)
[2] Sun J., Zhu J., Xu X., A new algorithm of Chinese words automatic clustering, Journal of Shanghai Jiaotong University, 37, pp. 139-142, (2003)
[3] Matsuzaki T., Miyao Y., An Efficient clustering algorithm for class-based language models, Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, pp. 119-126, (2003)
[4] Liu Y., Wang N., Zhang T., Spectral clustering for Chinese word, Proceedings of the Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 529-533, (2009)
[5] Yuan L., word clustering based on similarity and vari-gram language model, Journal of Chinese Computer Systems, 30, 5, pp. 912-915, (2009)
[6] Yuan L., Dependency language paring model based on word clustering, Journal of Central South University (Science and Technology), 42, 7, pp. 2023-2027, (2011)
[7] Liu S., Li S., Zhao T., Et al., Directly smooth interpolation algorithm in head-driven parsing, Journal of Software, 20, 11, pp. 2915-2924, (2009)
[8] Wu W., Zhou J., Qu W., A survey of syntactic parsing based on statistical learning, Journal of Chinese Information Processing, 27, 3, pp. 9-19, (2013)
[9] Dai Y., Wu C., Ma S., Et al., Hierarchically classified probabilistic grammar parsing, Journal of Software, 22, 2, pp. 245-257, (2011)
[10] Jurafsky D., Martin J.H., Speech and Language Processing, pp. 210-265, (2009)

← 1 2 →