Indexing and querying segmented web pages: the BlockWeb Model

被引:3
作者
Bruno, Emmanuel [1 ]
Faessel, Nicolas [2 ]
Glotin, Herve [1 ]
Le Maitre, Jacques [1 ]
Scholl, Michel [3 ]
机构
[1] Univ Sud Toulon Var, LSIS, F-83957 La Garde, France
[2] Univ Paul Cezanne, LSIS, F-13397 Marseille 20, France
[3] CNAM, F-75141 Paris 03, France
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2011年 / 14卷 / 5-6期
关键词
web page segmentation; block importance; block permeability; web image indexing; document indexing; document retrieval;
D O I
10.1007/s11280-011-0124-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b' in the same page if b' content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.
引用
收藏
页码:623 / 649
页数:27
相关论文
共 17 条
  • [1] [Anonymous], 2003, MSRTR200379
  • [2] [Anonymous], P ACM INT C MULT
  • [3] [Anonymous], 2002, Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
  • [4] [Anonymous], 1973, Pattern Classification and Scene Analysis
  • [5] BRUNO E, 2009, P 9 ACM S DOC ENG DO, P70
  • [6] BRUNO E, 2009, P 7 INT WORKSH CONT, P219
  • [7] CUI H, 2003, P 25 EUR C IR RES, P73
  • [8] Automatic identification of informative sections of Web pages
    Debnath, S
    Mitra, P
    Pal, N
    Giles, CL
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (09) : 1233 - 1246
  • [9] HA J, 1995, P 3 INT C DOC AN REC, V2, P952
  • [10] MOELLIC PA, 2006, IMAGEVAL 2006 OFFICI