A comparison of several statistical word clustering methods

被引:1
作者
Yuan L. [1 ]
机构
[1] Jiangxi Key Laboratory of Data and Knowledge Engineering, School of Information Technology, Jiangxi University of Finance and Economics, Nanchang
来源
Yuan, Lichi (yuanlichi@sohu.com) | 2016年 / Central South University of Technology卷 / 47期
基金
中国国家自然科学基金;
关键词
Mutual information; Natural language processing; Word clustering; Word similarity;
D O I
10.11817/j.issn.1672-7207.2016.09.023
中图分类号
学科分类号
摘要
Considering that sparse-data problem is a main issue that influences the performances of statistical language models, statistical language model based on word classes is an effective method to solve sparse-data problems. A definition of word similarity was proposed by utilizing mutual information of adjoining words, and the definition of word set similarity was given based on word similarity; a bottom-up hierarchical word clustering algorithm which can get global optimum was put forward. The results show that the word clustering algorithm has high executing speed and good clustering performances. The class-based models interpolated with the word-based models can mitigate remaining sparse-data problems of statistical language models. © 2016, Central South University Press. All right reserved.
引用
收藏
页码:3079 / 3084
页数:5
相关论文
共 15 条
  • [1] Chen L., Huang T., A novel word clustering algorithm and vari-gram language mode, Chinese Journal of Computers, 22, 9, pp. 942-947, (1999)
  • [2] Sun J., Zhu J., Xu X., A new algorithm of Chinese words automatic clustering, Journal of Shanghai Jiaotong University, 37, pp. 139-142, (2003)
  • [3] Matsuzaki T., Miyao Y., An Efficient clustering algorithm for class-based language models, Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, pp. 119-126, (2003)
  • [4] Liu Y., Wang N., Zhang T., Spectral clustering for Chinese word, Proceedings of the Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 529-533, (2009)
  • [5] Yuan L., word clustering based on similarity and vari-gram language model, Journal of Chinese Computer Systems, 30, 5, pp. 912-915, (2009)
  • [6] Yuan L., Dependency language paring model based on word clustering, Journal of Central South University (Science and Technology), 42, 7, pp. 2023-2027, (2011)
  • [7] Liu S., Li S., Zhao T., Et al., Directly smooth interpolation algorithm in head-driven parsing, Journal of Software, 20, 11, pp. 2915-2924, (2009)
  • [8] Wu W., Zhou J., Qu W., A survey of syntactic parsing based on statistical learning, Journal of Chinese Information Processing, 27, 3, pp. 9-19, (2013)
  • [9] Dai Y., Wu C., Ma S., Et al., Hierarchically classified probabilistic grammar parsing, Journal of Software, 22, 2, pp. 245-257, (2011)
  • [10] Jurafsky D., Martin J.H., Speech and Language Processing, pp. 210-265, (2009)