On ontology-driven document clustering using core semantic features

被引:62
作者
Fodeh, Samah [1 ]
Punch, Bill [2 ]
Tan, Pang-Ning [2 ]
机构
[1] Yale Univ, Yale Ctr Med Informat, New Haven, CT 06520 USA
[2] Michigan State Univ, Dept Comp Sci & Engn, E Lansing, MI 48824 USA
关键词
Clustering; Information gain; Semantic features; Ontology; Dimensionality reduction;
D O I
10.1007/s10115-010-0370-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.
引用
收藏
页码:395 / 421
页数:27
相关论文
共 42 条
[1]  
Al Sumait L, 2007, SIAM INT C DAT MIN W
[2]  
Andrzejewski David, 2009, Proc Int Conf Mach Learn, V382, P25
[3]  
[Anonymous], 2007, IJCAI
[4]  
[Anonymous], 2008, International Conference on Research and Development in Information Retrieval, DOI [10.1145/, DOI 10.1145/1390334.1390367]
[5]  
[Anonymous], INT C RES COMP LING
[6]  
[Anonymous], 2003, Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining
[7]  
[Anonymous], 3 WORKSH ROB METH AN
[8]  
Banerjee Somnath, 2007, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P787, DOI 10.1145/1277741.1277909
[9]  
Basu S, 2002, MACHINE LEARNING, P19
[10]  
Bodner R. C., 1996, Advances in Artificial Intelligence. 11th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, AI'96. Proceedings, P146