Analysis of Similarity Measures with WordNet Based Text Document Clustering

被引:0
作者
Sandhya, Nadella [1 ]
Govardhan, A. [2 ]
机构
[1] Gokaraju Rangaraju Inst Engn & Technol, CSE Dept, Hyderabad 500072, Andhra Pradesh, India
[2] JNTUH Coll Engn, Hyderabad 505501, Andhra Pradesh, India
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012) | 2012年 / 132卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text Document Clustering aids in reorganizing the large collections of documents into a smaller number of manageable clusters. While several clustering methods and the associated similarity measures have been proposed in the past, the partition clustering algorithms are reported performing well on document clustering. Usually cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. Word meanings are better than word forms in terms of representing the topics of documents. Thus, here we have involved ontology into the text clustering algorithm. In this research WordNet based document representation is attempted by assigning each word a part-of-speech (POS) tag and by enriching the tag-of-words' data representation with synset concept which corresponds to synonym set that is introduced by WordNet. After replacing the 'bag of words' with their respective Synset IDs a variant of K-Means algorithm is used for document clustering. Then we compare the three popular similarity measures (Cosine, Pearson Correlation Coefficient and extended Jaccard) in conjunction with different types of vector space representation (Term Frequency and Term Frequency-Inverse Document Frequency) of documents.
引用
收藏
页码:703 / +
页数:2
相关论文
共 50 条
[41]   Adaptive document clustering based on query-based similarity [J].
Na, Seung-Hoon ;
Kang, In-Su ;
Lee, Jong-Hyeok .
INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (04) :887-901
[42]   Hierarchical Document Clustering based on Cosine Similarity measure [J].
Popat, Shraddha K. ;
Deshmukh, Pramod B. ;
Metre, Vishakha A. .
2017 1ST INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND INFORMATION MANAGEMENT (ICISIM), 2017, :153-159
[43]   Efficient phrase-based document similarity for clustering [J].
Chim, Hung ;
Deng, Xiaotie .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (09) :1217-1229
[44]   Intelligent Text Clustering Analysis of Novels Based on Digital Semantic Similarity Calculation [J].
Sun X. .
Computer-Aided Design and Applications, 2024, 21 (S16) :199-213
[45]   Ontology based Semantic Measures in Document Similarity Ranking [J].
Sridevi, U. K. ;
Nagaveni, N. .
2009 INTERNATIONAL CONFERENCE ON ADVANCES IN RECENT TECHNOLOGIES IN COMMUNICATION AND COMPUTING (ARTCOM 2009), 2009, :482-+
[46]   Ranking invariance based on similarity measures in document retrieval [J].
Omhover, JF ;
Rifqi, M ;
Detyniecki, M .
ADAPTIVE MULTIMEDIA RETRIEVAL: USER, CONTEXT, AND FEEDBACK, 2006, 3877 :55-64
[47]   Similarity Based Hierarchical Clustering with an Application to Text Collections [J].
Ah-Pine, Julien ;
Wang, Xinyu .
ADVANCES IN INTELLIGENT DATA ANALYSIS XV, 2016, 9897 :320-331
[48]   Self-adaptive GA, Quantitative Semantic Similarity Measures and Ontology-based Text Clustering [J].
Zhang, Chengzhi ;
Song, Wei ;
Li, Chenghua ;
Yu, Wei .
IEEE NLP-KE 2008: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2008, :95-+
[49]   A Text Document Clustering Method Based on Topical Concept [J].
Ding, Yi ;
Fu, Xian .
ADVANCES IN ELECTRONIC COMMERCE, WEB APPLICATION AND COMMUNICATION, VOL 1, 2012, 148 :547-552
[50]   A parallel text document clustering algorithm based on neighbors [J].
Li, Yanjun ;
Luo, Congnan ;
Chung, Soon M. .
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2015, 18 (02) :933-948