Analysis of Similarity Measures with WordNet Based Text Document Clustering

被引:0
作者
Sandhya, Nadella [1 ]
Govardhan, A. [2 ]
机构
[1] Gokaraju Rangaraju Inst Engn & Technol, CSE Dept, Hyderabad 500072, Andhra Pradesh, India
[2] JNTUH Coll Engn, Hyderabad 505501, Andhra Pradesh, India
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012) | 2012年 / 132卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text Document Clustering aids in reorganizing the large collections of documents into a smaller number of manageable clusters. While several clustering methods and the associated similarity measures have been proposed in the past, the partition clustering algorithms are reported performing well on document clustering. Usually cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. Word meanings are better than word forms in terms of representing the topics of documents. Thus, here we have involved ontology into the text clustering algorithm. In this research WordNet based document representation is attempted by assigning each word a part-of-speech (POS) tag and by enriching the tag-of-words' data representation with synset concept which corresponds to synonym set that is introduced by WordNet. After replacing the 'bag of words' with their respective Synset IDs a variant of K-Means algorithm is used for document clustering. Then we compare the three popular similarity measures (Cosine, Pearson Correlation Coefficient and extended Jaccard) in conjunction with different types of vector space representation (Term Frequency and Term Frequency-Inverse Document Frequency) of documents.
引用
收藏
页码:703 / +
页数:2
相关论文
共 50 条
[31]   Validation of text clustering based on document contents [J].
Toivonen, J ;
Visa, A ;
Vesanen, T ;
Back, B ;
Vanharanta, H .
MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2001, 2123 :184-195
[32]   Document Similarity Measures and Document Browsing [J].
Ahmadullin, Ildus ;
Fan, Jian ;
Damera-Venkata, Niranjan ;
Lim, Suk Hwan ;
Lin, Qian ;
Liu, Jerry ;
Liu, Sam ;
O'Brien-Strain, Eamonn ;
Allebach, Jan .
IMAGING AND PRINTING IN A WEB 2.0 WORLD II, 2011, 7879
[33]   Text-Based Measures of Document Diversity [J].
Bache, Kevin ;
Newman, David ;
Smyth, Padhraic .
19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, :23-31
[34]   An integration of fuzzy association rules and WordNet for document clustering [J].
Chen, Chun-Ling ;
Tseng, Frank S. C. ;
Liang, Tyne .
KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 28 (03) :687-708
[35]   An integration of fuzzy association rules and WordNet for document clustering [J].
Chun-Ling Chen ;
Frank S. C. Tseng ;
Tyne Liang .
Knowledge and Information Systems, 2011, 28 :687-708
[36]   An Integration of Fuzzy Association Rules and WordNet for Document Clustering [J].
Chen, Chun-Ling ;
Tseng, Frank S. C. ;
Liang, Tyne .
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 :147-+
[37]   A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification [J].
Vinay Kumar Kotte ;
Srinivasan Rajavelu ;
Elijah Blessing Rajsingh .
Foundations of Science, 2020, 25 :1077-1094
[38]   A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification [J].
Kotte, Vinay Kumar ;
Rajavelu, Srinivasan ;
Rajsingh, Elijah Blessing .
FOUNDATIONS OF SCIENCE, 2020, 25 (04) :1077-1094
[39]   Semantic similarity measures for formal concept analysis using linked data and WordNet [J].
Jiang, Yuncheng ;
Yang, Mingxuan ;
Qu, Rong .
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (14) :19807-19837
[40]   Semantic similarity measures for formal concept analysis using linked data and WordNet [J].
Yuncheng Jiang ;
Mingxuan Yang ;
Rong Qu .
Multimedia Tools and Applications, 2019, 78 :19807-19837