SPHINX: a framework for creating personal, site-specific Web crawlers

被引:29
作者
Miller, RC
Bharat, K
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15208 USA
[2] Digital Equipment Corp, Syst Res Ctr, Palo Alto, CA 94301 USA
来源
COMPUTER NETWORKS AND ISDN SYSTEMS | 1998年 / 30卷 / 1-7期
关键词
crawlers; robots; spiders; Web automation; Web searching; !text type='Java']Java[!/text; end-user programming; mobile code;
D O I
10.1016/S0169-7552(98)00064-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Crawlers, also called robots and spiders, are programs that browse the World Wide Web autonomously. This paper describes SPHINX, a Java toolkit and interactive development environment for Web crawlers. Unlike other crawler development systems, SPHINX is geared towards developing crawlers that are Web-site-specific, personally customized, and relocatable. SPHINX allows site-specific crawling rules to be encapsulated and reused in content analyzers, known as classifiers. Personal crawling tasks can be performed (often without programming) in the Crawler Workbench, an interactive environment for crawler development and testing. For efficiency, relocatable crawlers developed using SPHINX can be uploaded and executed on a remote Web server. (C) 1998 Published by Elsevier Science B.V. All rights reserved.
引用
收藏
页码:119 / 130
页数:12
相关论文
共 20 条
[1]  
[Anonymous], 1994, STANDARD ROBOT EXCLU
[2]  
[Anonymous], P CHI 95, DOI DOI 10.1145/223904.223956
[3]  
BERGER A, UNPUB INT C AC SPEEC
[4]  
Broder A. Z., 1997, P 6 INT WORLD WID WE, V29, P1157, DOI [10.1016/S0169-7552(97)00031-7, DOI 10.1016/S0169-7552(97)00031-7]
[5]  
*CAN INF SYST RES, CAN WEBR
[6]  
DUDA A, 1996, P WWW5 PAR
[7]  
ETZIONI O, P AUT AG 97
[8]  
FIELDING RT, 1994, P WWW1 GEN MAY
[9]  
FRYSTYK H, 1994, P WWW2 MOS WEB CHIC
[10]  
Furnas G., 1986, P SIGCHI C HUM FACT, V17, P16, DOI [DOI 10.1145/22339.22342, DOI 10.1145/22627.22342]