A text similarity measurement combining word semantic information with TF-IDF method

被引:73
|
作者
Huang C.-H. [1 ,2 ]
Yin J. [1 ]
Hou F. [2 ]
机构
[1] School of Information Science and Technology, SUN Yat-Sen University
[2] Department of Computer Science and Technology, Guangdong University of Finance
来源
关键词
Natural language process; Term semantic similarity; Text clustering; Text similarity;
D O I
10.3724/SP.J.1016.2011.00856
中图分类号
学科分类号
摘要
Traditional text similarity measurements use TF-IDF method to model text documents as term frequency vectors, and compute similarity between text documents by using cosine similarity. These methods ignore semantic information of text documents, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. This paper proposes a similarity measurement, which is based on TF-IDF method, and analyzes similarity between important terms in text documents. This approach uses NLP technology to pre-process text, and uses TF-IDF method to filter those key terms that have higher TF-IDF value than other common terms. With the proposed data structure TSWT (Term Similarity Weight Tree) and the definition of semantic similarity, this paper resolves the semantic information of those key terms to compute similarities between text documents. Finally, several K-Means clustering methods is used for evaluating performance of the new text document similarity. By comparing with TF-IDF and another the-state-of-art semantic information based similarity method, experimental results on benchmark corpus demonstrate that it can promote the evaluation metrics of F-Measure.
引用
收藏
页码:856 / 864
页数:8
相关论文
共 19 条
  • [1] Fung B.C.M., Wang K., Ester M., Hierarchical document clustering, The Encyclopedia of Data Warehousing and Mining, pp. 970-975, (2005)
  • [2] Salton G., The SMART Retrieval System-Experiments in Automatic Document Processing, (1971)
  • [3] Wang Y., Julia H., Document clustering with semantic analysis, Proceedings of the 39th Hawaii International Conferences on System Sciences, pp. 54-63, (2006)
  • [4] Hotho A., Staab S., Stumme G., Wordnet improves text document clustering, Proceedings of the Semantic Web Workshop at SIGIR-2003, 26th Annual International ACM SIGIR Conference, pp. 541-550, (2003)
  • [5] Hall P., Dowling G., Approximate string matching, Computing Survey, 12, 4, pp. 381-402, (1980)
  • [6] Coelho T., Calado P., Souza L., Ribeiro-Neto B., Muntz R., Image retrieval using multiple evidence ranking, IEEE Transactions on Knowledge and Data Engineering, 16, 4, pp. 408-417, (2004)
  • [7] Ko Y., Park J., Seo J., Improving text categorization using the importance of sentences, Information Processing and Management, 40, 1, pp. 65-79, (2004)
  • [8] Erkan G., Radev D., Lexrank: Graph-based lexical centrality as salience in text summarization, Journal of Artificial Intelligence Research, 22, 7, pp. 457-479, (2004)
  • [9] Theobald M., Siddharth J., Paepcke A., SpotSigs: Robust and efficient near duplicate detection in large Web collections, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563-570, (2008)
  • [10] Han J., Kamber M., Data Mining: Concept and Techniques, (2006)