Organizing hidden-Web databases by clustering visible Web documents

被引：0

作者：

Barbosa, Luciano ^{[1
]}

Freire, Juliana ^{[1
]}

Silva, Altigran ^{[2
]}

机构：

[1] Univ Utah, Salt Lake City, UT 84112 USA

[2] Univ Fed Amazonas, Manaus, Amazonas, Brazil

来源：

2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3 | 2007年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms-as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of,forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search inter-faces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters-measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.

引用

页码：301 / +

页数：2

共 50 条

[11] Hidden schema extraction in web documents
Carchiolo, V
Longheu, A
Malgeri, M
DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 42 - 52
[12] Hidden schema extraction in web documents
1600, International Affairs Committee; University of Aizu, (Springer Verlag):
[13] Hidden-Web Privacy Preservation Surfing (Hi-WePPS) model
Elovici, Y
Shapira, B
Spanglet, Y
PRIVACY AND TECHNOLOGIES OF IDENTITY: A CROSS-DISCIPLINARY CONVERSATION, 2006, : 335 - 348
[14] Clustering Deep Web databases semantically
Song, Ling
Ma, Jun
Yan, Po
Lian, Li
Zhang, Dongmei
INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 365 - 376
[15] Clustering Retrieved Web Documents to Speed Up Web Searches
Qumsiyeh, Rani
Ng, Yiu-Kai
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2015, PT I, 2015, 9155 : 472 - 488
[16] Web documents clustering with interest links
Cui, ZF
Xu, BW
Zhang, WF
Xu, JL
SOSE 2005: IEEE INTERNATIONAL WORKSHOP ON SERVICE-ORIENTED SYSTEM ENGINEERING, 2005, : 111 - 116
[17] Semantic based clustering of web documents
Lin, TY
Chiang, IJ
2005 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2005, : 189 - 192
[18] Fast fuzzy clustering of Web documents
Wang, Jian-Hui
Jiang, Long-Bin
Yang, Shu
Chang'an Daxue Xuebao (Ziran Kexue Ban)/Journal of Chang'an University (Natural Science Edition), 2007, 27 (02): : 107 - 110
[19] Clustering template based web documents
Gottron, Thomas
ADVANCES IN INFORMATION RETRIEVAL, 2008, 4956 : 40 - 51
[20] Clustering of Short Commercial Documents for the Web
Carullo, Moreno
Binaghi, Elisabetta
Gallo, Ignazio
Lamberti, Nicola
19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 1873 - +

← 1 2 3 4 5 →