QProber: A system for automatic classification of hidden-Web databases

被引:44
作者
Gravano, L
Ipeirotis, PG
Sahami, M
机构
[1] Columbia Univ, Dept Comp Sci, New York, NY 10027 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
algorithms; experimentation; performance; database classification; Web databases; hidden Web;
D O I
10.1145/635484.635485
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web "crawlers." Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
引用
收藏
页码:1 / 41
页数:41
相关论文
共 57 条
[1]  
AGICHTEIN E, 2003, P 19 IEEE INT C DAT
[2]  
AGICHTEIN E, 2000, P 5 ACM C DIG LIB DL
[3]  
Agrawal R., 1994, P 20 INT C VER LARG, V1215, P487
[4]  
[Anonymous], 1998, AAAI 98 WORKSHOP LEA
[5]  
[Anonymous], [No title captured]
[6]  
[Anonymous], P 21 ANN INT ACM SIG
[7]  
[Anonymous], P ICML 97
[8]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[9]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[10]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167