Utilizing the Structure and Content Information for XML Document Clustering

被引:0
作者
Tran, Tien [1 ]
Kutty, Sangeetha [1 ]
Nayak, Richi [1 ]
机构
[1] Queensland Univ Technol, Fac Sci & Technol, Brisbane, Qld 4001, Australia
来源
ADVANCES IN FOCUSED RETRIEVAL | 2009年 / 5631卷
关键词
Wikipedia; clustering; LSK; INEX; 2008;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge.
引用
收藏
页码:460 / 468
页数:9
相关论文
共 14 条
[1]  
CRISTIANINI N, 2002, JJIS 2002, V18
[2]  
DOUCET A, 2006, INEX 2006, P497
[3]  
Garcia E., 2006, DESCRIPTION ADVANTAG
[4]  
Han J., 2012, Data Mining, P393, DOI [DOI 10.1016/B978-0-12-381479-1.00009-5, 10.1016/B978-0-12-381479-1.00009-5]
[5]  
KARYPIS G, CLUTO SOFTWARE CLUST
[6]  
KURGAN L, 2002, CIKM 2002
[7]   An introduction to latent semantic analysis [J].
Landauer, TK ;
Foltz, PW ;
Laham, D .
DISCOURSE PROCESSES, 1998, 25 (2-3) :259-284
[8]   A progressive clustering algorithm to group the XML data by structural and semantic similarity [J].
Nayak, Richi ;
Tran, Tien .
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (04) :723-743
[9]  
Nayak R, 2006, LECT NOTES ARTIF INT, V3918, P292
[10]  
Salton G., 1989, Introduction to modern information retrieval