A topic-specific crawling strategy based on semantics similarity

被引:17
作者
Du, YaJun [1 ]
Pen, QiangQiang [1 ]
Gao, ZhaoQiong [1 ]
机构
[1] Xihua Univ, Sch Math & Comp Sci, Chengdu 610039, Sichuan, Peoples R China
关键词
Search engine; Focused crawling; Formal concept analysis; Web crawler; Concept context graph; Web information systems; Information retrieval; FORMAL CONCEPT ANALYSIS; INFORMATION-RETRIEVAL; COMPUTER-NETWORKS; CONCEPT LATTICES; ISDN SYSTEMS; WEB; MODELS; PERFORMANCE;
D O I
10.1016/j.datak.2013.09.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the Internet growing exponentially, search engines are encountering unprecedented challenges. A focused search engine selectively seeks out web pages that are relevant to user topics. Determining the best strategy to utilize a focused search is a crucial and popular research topic. At present, the rank values of unvisited web pages are computed by considering the hyperlinks (as in the PageRank algorithm), a Vector Space Model and a combination of them, and not by considering the semantic relations between the user topic and unvisited web pages. In this paper, we propose a concept context graph to store the knowledge context based on the user's history of clicked web pages and to guide a focused crawler for the next crawling. The concept context graph provides a novel semantic ranking to guide the web crawler in order to retrieve highly relevant web pages on the user's topic. By computing the concept distance and concept similarity among the concepts of the concept context graph and by matching unvisited web pages with the concept context graph, we compute the rank values of the unvisited web pages to pick out the relevant hyperlinks. Additionally, we constitute the focused crawling system, and we retrieve the precision, recall, average harvest rate, and F-measure of our proposed approach, using Breadth First, Cosine Similarity, the Link Context Graph and the Relevancy Context Graph. The results show that our proposed method outperforms other methods. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:75 / 93
页数:19
相关论文
共 48 条
[1]   An architecture for a focused trend parallel Web crawler with the application of clickstrearn analysis [J].
Ahmadi-Abkenari, Fatemeh ;
Selamat, Ali .
INFORMATION SCIENCES, 2012, 184 (01) :266-281
[2]  
[Anonymous], 1994, 4 INT C INTELLIGENTM
[3]  
[Anonymous], 2009, COMPUT SCI
[4]   Improving the performance of focused web crawlers [J].
Batsakis, Sotiris ;
Petrakis, Euripides G. M. ;
Milios, Evangelos .
DATA & KNOWLEDGE ENGINEERING, 2009, 68 (10) :1001-1013
[5]  
Birkhoff G., 1979, 25 AMS C PUBL
[6]   The anatomy of a large-scale hypertextual web search engine (Reprint from COMPUTER NETWORKS AND ISDN SYSTEMS, vol 30, pg 107-117, 1998) [J].
Brin, Sergey ;
Page, Lawrence .
COMPUTER NETWORKS, 2012, 56 (18) :3825-3833
[7]  
Carpineto C, 2005, LECT NOTES ARTIF INT, V3626, P161
[8]   A machine learning approach to web page filtering using content and structure analysis [J].
Chau, Michael ;
Chen, Hsinchun .
DECISION SUPPORT SYSTEMS, 2008, 44 (02) :482-494
[9]  
Cho J., 2006, P 15 ACM C INF KNOWL
[10]   Efficient crawling through URL ordering (Reprinted from COMPUTER NETWORKS AND ISDN SYSTEMS, vol 30, pg 161, 1998) [J].
Cho, Junghoo ;
Garcia-Molina, Hector ;
Page, Lawrence .
COMPUTER NETWORKS, 2012, 56 (18) :3849-3858