Analysis of Similarity Measures with WordNet Based Text Document Clustering

被引:0
|
作者
Sandhya, Nadella [1 ]
Govardhan, A. [2 ]
机构
[1] Gokaraju Rangaraju Inst Engn & Technol, CSE Dept, Hyderabad 500072, Andhra Pradesh, India
[2] JNTUH Coll Engn, Hyderabad 505501, Andhra Pradesh, India
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012) | 2012年 / 132卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text Document Clustering aids in reorganizing the large collections of documents into a smaller number of manageable clusters. While several clustering methods and the associated similarity measures have been proposed in the past, the partition clustering algorithms are reported performing well on document clustering. Usually cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. Word meanings are better than word forms in terms of representing the topics of documents. Thus, here we have involved ontology into the text clustering algorithm. In this research WordNet based document representation is attempted by assigning each word a part-of-speech (POS) tag and by enriching the tag-of-words' data representation with synset concept which corresponds to synonym set that is introduced by WordNet. After replacing the 'bag of words' with their respective Synset IDs a variant of K-Means algorithm is used for document clustering. Then we compare the three popular similarity measures (Cosine, Pearson Correlation Coefficient and extended Jaccard) in conjunction with different types of vector space representation (Term Frequency and Term Frequency-Inverse Document Frequency) of documents.
引用
收藏
页码:703 / +
页数:2
相关论文
共 50 条
  • [1] Analysis of similarity measures with WordNet based text document clustering
    Sandhya, Nadella
    Govardhan, A.
    Advances in Intelligent and Soft Computing, 2012, 132 AISC : 703 - 714
  • [2] WordNet and Semantic Similarity based Approach for Document Clustering
    Desai, Sneha S.
    Laxminarayana, J. A.
    2016 INTERNATIONAL CONFERENCE ON COMPUTATION SYSTEM AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTIONS (CSITSS), 2016, : 312 - 317
  • [3] Efficient text document clustering with new similarity measures
    Lakshmi R.
    Baskar S.
    International Journal of Business Intelligence and Data Mining, 2021, 18 (01) : 109 - 126
  • [4] Comparative Analysis of Similarity Measures in Document Clustering
    Karun, Kavitha A.
    Philip, Mintu
    Lubna, K.
    2013 INTERNATIONAL CONFERENCE ON GREEN COMPUTING, COMMUNICATION AND CONSERVATION OF ENERGY (ICGCE), 2013, : 857 - 860
  • [5] Frequent Term Based Text Document Clustering Using Similarity Measures: A Novel Approach
    Gupta, Vijay Kumar
    Dutta, Maitreyee
    Kumar, Manoj
    2017 FOURTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP), 2017, : 164 - 169
  • [6] An algorithm for semantic similarity of short text based on WordNet
    Zhai, Yan-Dong
    Wang, Kang-Ping
    Zhang, Dong-Na
    Hunag, Lan
    Zhou, Chun-Guang
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2012, 40 (03): : 617 - 620
  • [7] Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results
    Yueyang Zhao
    Lei Cui
    Scientometrics, 2023, 128 : 1163 - 1186
  • [8] An Intelligent Similarity Measure for Effective Text Document Clustering
    Aishwarya, M. L.
    Selvi, K.
    2016 INTERNATIONAL CONFERENCE ON COMPUTING TECHNOLOGIES AND INTELLIGENT DATA ENGINEERING (ICCTIDE'16), 2016,
  • [9] Medical document clustering using ontology-based term similarity measures
    College of Information Science and Technology, Drexel University, Philadelphia, PA, United States
    不详
    不详
    不详
    不详
    Int. J. Data Warehouse. Min., 2008, 1 (62-73):
  • [10] A comparative study of ontology based term similarity measures on PubMed document clustering
    Zhang, Xiaodan
    Jing, Liping
    Hu, Xiaohua
    Ng, Michael
    Zhou, Xiaohua
    ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 115 - +