Organizing hidden-Web databases by clustering visible Web documents

被引:0
|
作者
Barbosa, Luciano [1 ]
Freire, Juliana [1 ]
Silva, Altigran [2 ]
机构
[1] Univ Utah, Salt Lake City, UT 84112 USA
[2] Univ Fed Amazonas, Manaus, Amazonas, Brazil
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms-as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of,forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search inter-faces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters-measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.
引用
收藏
页码:301 / +
页数:2
相关论文
共 50 条
  • [11] Hidden schema extraction in web documents
    Carchiolo, V
    Longheu, A
    Malgeri, M
    DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 42 - 52
  • [12] Hidden schema extraction in web documents
    1600, International Affairs Committee; University of Aizu, (Springer Verlag):
  • [13] Hidden-Web Privacy Preservation Surfing (Hi-WePPS) model
    Elovici, Y
    Shapira, B
    Spanglet, Y
    PRIVACY AND TECHNOLOGIES OF IDENTITY: A CROSS-DISCIPLINARY CONVERSATION, 2006, : 335 - 348
  • [14] Clustering Deep Web databases semantically
    Song, Ling
    Ma, Jun
    Yan, Po
    Lian, Li
    Zhang, Dongmei
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 365 - 376
  • [15] Clustering Retrieved Web Documents to Speed Up Web Searches
    Qumsiyeh, Rani
    Ng, Yiu-Kai
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2015, PT I, 2015, 9155 : 472 - 488
  • [16] Web documents clustering with interest links
    Cui, ZF
    Xu, BW
    Zhang, WF
    Xu, JL
    SOSE 2005: IEEE INTERNATIONAL WORKSHOP ON SERVICE-ORIENTED SYSTEM ENGINEERING, 2005, : 111 - 116
  • [17] Semantic based clustering of web documents
    Lin, TY
    Chiang, IJ
    2005 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2005, : 189 - 192
  • [18] Fast fuzzy clustering of Web documents
    Wang, Jian-Hui
    Jiang, Long-Bin
    Yang, Shu
    Chang'an Daxue Xuebao (Ziran Kexue Ban)/Journal of Chang'an University (Natural Science Edition), 2007, 27 (02): : 107 - 110
  • [19] Clustering template based web documents
    Gottron, Thomas
    ADVANCES IN INFORMATION RETRIEVAL, 2008, 4956 : 40 - 51
  • [20] Clustering of Short Commercial Documents for the Web
    Carullo, Moreno
    Binaghi, Elisabetta
    Gallo, Ignazio
    Lamberti, Nicola
    19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 1873 - +