Block-based similarity search on the Web using manifold-ranking

被引:0
作者
Wan, Xiaojun [1 ]
Yang, Jianwu [1 ]
Xiao, Jianguo [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China
来源
WEB INFORMATION SYSTEMS - WISE 2006, PROCEEDINGS | 2006年 / 4255卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similarity search on the web aims to find web pages similar to a query page and return a ranked list of similar web pages. The popular approach to web page similarity search is to calculate the pairwise similarity between web pages using the Cosine measure and then rank the web pages by their similarity values with the query page. In this paper, we proposed a novel similarity search approach based on manifold-ranking of page blocks to re-rank the initially retrieved web pages. First, web pages are segmented into semantic blocks with the VIPS algorithm. Second, the blocks get their ranking scores based on the manifold-ranking algorithm. Finally, web pages are re-ranked according to the overall retrieval scores obtained by fusing the ranking scores of the corresponding blocks. The proposed approach evaluates web page similarity at a finer granularity of page block instead of at the traditionally coarse granularity of the whole web page. Moreover, it can make full use of the intrinsic global manifold structure of the blocks to rank the blocks more appropriately. Experimental results on the ODP data demonstrate that the proposed approach can significantly outperform the popular Cosine measure. Semantic block is validated to be a better unit than the whole web page in the manifold-ranking process.
引用
收藏
页码:60 / 71
页数:12
相关论文
共 22 条
[1]  
[Anonymous], 2004, KDD '03, DOI DOI 10.1145/988672.988700
[2]  
Baeza-Yates R.A., 1999, Modern Information Retrieval
[3]  
CAI D, 2003, MSRTR200379 MICR
[4]  
CAI D, 2004, P 12 ACM INT C MULT
[5]  
CAI D, 2004, P 27 ANN INT ACM SIG
[6]  
CAI D, 2004, P 2004 IEEE INT C MU
[7]  
CRUZ, 1998, P 7 INT C EL PUBL, P513
[8]  
DEAN J, P 8 INT C WORLD WID, P1467
[9]  
FOGARAS D, 2004, SCALING LINK BASED S
[10]  
Haveliwala Taher H., 2002, P 11 INT C WORLD WID, P432, DOI [DOI 10.1145/511446.511502, 10.1145/511446.511502]