Link contexts in classifier-guided topical crawlers

被引:57
作者
Pant, G
Srinivasan, P
机构
[1] Univ Utah, Sch Accounting & Informat Syst, Salt Lake City, UT 84112 USA
[2] Univ Iowa, Sch Lib & Informat Sci, Iowa City, IA 52242 USA
关键词
Web search; Web mining; performance evaluation;
D O I
10.1109/TKDE.2006.12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a Support Vector Machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent paged performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We analyze our results along various dimensions such as link context quality, topic difficulty, length of crawl, training data, and topic domain. The study was done using multiple crawls over 100 topics covering millions of pages allowing us to derive statistically strong results.
引用
收藏
页码:107 / 122
页数:16
相关论文
共 39 条
[1]  
Aggarwal C.C., 2002, P 8 ACM SIGKDD INT C, P423, DOI DOI 10.1145/775047.775108
[2]  
AGGARWAL CC, 2001, P 10 INT WORLD WID W
[3]  
[Anonymous], P 7 EUR C RES ADV TE
[4]  
[Anonymous], THESIS
[5]  
ATTARDI G, 1999, P THAI 99 1 EUR S TE
[6]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[7]  
Chakrabarti S., 1998, P 7 INT WORLD WID WE
[8]  
CHAKRABARTI S, 1999, P 8 INT WORLD WID WE
[9]  
CHAKRABARTI S, 2002, P 11 INT WORLD WID W
[10]  
CHEN H, 2002, DECIS SUPPORT SYST, P1, DOI DOI 10.1016/S0167-9236(02)00002-7