Using word clusters to detect similar web documents

被引:0
作者
Koberstein, Jonathan [1 ]
Ng, Yiu-Kai [1 ]
机构
[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA
来源
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT | 2006年 / 4092卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is relatively easy to detect exact matches in Web documents; however, detecting similar content in distinct Web documents with different words and sentence structures is a much more difficult task. A reliable tool for determining the degree of similarity between any two Web documents could help filter or retain Web documents with similar content. Most methods for detecting similarity between documents rely on some kind of textual fingerprinting or a process of looking for exactly matched substrings. This may not be sufficient as changing the sentence structure or replacing words with synonyms can cause sentences with similar/same content to be treated as different. In this paper, we develop a sentence-based Fuzzy Set Information Retrieval (IR) approach, using word clusters that capture the similarity between different words for discovering similar documents. Our approach has the advantages of detecting documents with similar, but not necessarily the same, sentences based on fuzzy-word sets. The three different fuzzy-word clustering techniques that we have considered include the correlation cluster, the association cluster, and the metric cluster, which generate the word-to-word correlation values. Experimental results show that by adopting the metric cluster, our similarity detection approach has high accurate rate in detecting similar documents and improves previous Fuzzy Set IR approaches based solely on the correlation cluster.
引用
收藏
页码:215 / 228
页数:14
相关论文
共 14 条
[1]  
Baeza-Yates R.A., 1999, Modern Information Retrieval
[2]  
BRIN S, 1995, P 1995 ACM SIGMOD IN, P398
[3]  
Congdon P, 2001, BAYESIAN STAT MODELL
[4]  
Cooper J. W., 2002, Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM 2002, P245, DOI 10.1145/584792.584835
[5]  
MANBER U, 1994, USENIX WINT TECHN C
[6]  
NEVIN H, 1996, P 2 USENIX WORKSH EL, P191
[7]  
Pearl J., 1989, Probabilistic reasoning in intelligent systems: networks of plausible inference, DOI DOI 10.1016/C2009-0-27609-4
[8]  
Pereira Jr A.R., 2004, J WEB ENG, V2, P247
[9]   AN ALGORITHM FOR SUFFIX STRIPPING [J].
PORTER, MF .
PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS, 1980, 14 (03) :130-137
[10]  
RABELO J, 2001, P INT C SYST MEN CYB, P549