A New Word Clustering Algorithm Based on Word Similarity

被引:0
作者
YUAN Lichi [1 ]
机构
[1] School of Information Technology, Jiangxi University of Finance and Economics
基金
中国国家自然科学基金;
关键词
Word similarity; Word clustering; Statistical language model;
D O I
暂无
中图分类号
TP391.1 [文字信息处理];
学科分类号
081203 ; 0835 ;
摘要
Category-based statistic language model is an important method to solve the problem of sparse data in statistical language models. But there are two bottlenecks about this model: 1) The problem of word clustering, it is hard to find a suitable clustering method that has good performance and has not large amount of computation; 2)Class-based method always loses some prediction ability to adapt the text of different domain. In order to solve above problems, a novel definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given and a bottom-up hierarchical clustering algorithm was proposed.Experimental results show that the word clustering algorithm based on word similarity is better than conventional greedy clustering method in speed and performance, the perplexity is reduced from 283 to 207.8.
引用
收藏
页码:1221 / 1226
页数:6
相关论文
共 50 条
  • [21] WEWD: A Combined Approach for Measuring Cross-lingual Semantic Word Similarity Based on Word Embeddings and Word Definitions
    Van-Tan Bui
    Phuong-Thai Nguyen
    [J]. 2021 RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF 2021), 2021, : 37 - 42
  • [22] Graph and Centroid-based Word Clustering
    Thaiprayoon, Santipong
    Unger, Herwig
    Kubek, Mario
    [J]. 2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 163 - 168
  • [23] Word Semantic Similarity Research Based on Latent Relationships
    Lin, Xiaoqing
    Wang, Danling
    [J]. 2013 2ND INTERNATIONAL SYMPOSIUM ON INSTRUMENTATION AND MEASUREMENT, SENSOR NETWORK AND AUTOMATION (IMSNA), 2013, : 168 - 171
  • [24] An overview of word and sense similarity
    Navigli, Roberto
    Martelli, Federico
    [J]. NATURAL LANGUAGE ENGINEERING, 2019, 25 (06) : 693 - 714
  • [25] A comparison of several statistical word clustering methods
    Yuan L.
    [J]. Yuan, Lichi (yuanlichi@sohu.com), 2016, Central South University of Technology (47): : 3079 - 3084
  • [26] A Modified Approach to Keyword Extraction Based on Word-similarity
    Meng Wenchao
    Liu Lianchen
    Dai Ting
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INTELLIGENT SYSTEMS, PROCEEDINGS, VOL 3, 2009, : 388 - 392
  • [27] Construction of a Japanese Word Similarity Dataset
    Sakaizawa, Yuya
    Komachi, Mamoru
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 948 - 951
  • [28] A Hypothesis on Word Similarity and Its Application
    Jin, Peng
    Qiu, Likun
    Zhu, Xuefeng
    Liu, Pengyuan
    [J]. CHINESE LEXICAL SEMANTICS, 2014, 8922 : 317 - 325
  • [29] Extended Word Similarity Based Clustering on Unsupervised PoS Induction to Improve English-Indonesian Statistical Machine Translation
    Sujaini, Herry
    Arman, Arry Akhmad
    Purwarianti, Ayu
    Kuspriyanto
    [J]. 2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [30] A Word Clustering-Based Crime Report Categorization Technique
    Das, Priyanka
    Das, Asit Kumar
    [J]. COMPUTATIONAL INTELLIGENCE IN PATTERN RECOGNITION, CIPR 2020, 2020, 1120 : 451 - 463