Extracting Visually Presented Element Relationships from Web Documents

被引:0
作者
Burget, Radek [1 ]
Smrz, Pavel [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol, IT4Innovat Ctr Excellence, Brno, Czech Republic
基金
欧盟第七框架计划;
关键词
Document Analysis; Element Relationships; Logical Document Structure; Page Segmentation; Web Documents;
D O I
10.4018/ijcini.2013040102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, the authors propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applications. The authors formally define the model, the authors introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and the authors discuss the expected applications. The authors also present a new dataset consisting of programmes of conferences and other scientific events and the authors discuss its suitability for the task in hand. Finally, the authors use the dataset to evaluate results of the implemented system.
引用
收藏
页码:13 / 29
页数:17
相关论文
共 8 条
  • [1] An Expansion Method of XML Element Retrieval Techniques into Web Documents
    Keyaki, Atsushi
    Miyazaki, Jun
    Hatano, Kenji
    2014 IIAI 3RD INTERNATIONAL CONFERENCE ON ADVANCED APPLIED INFORMATICS (IIAI-AAI 2014), 2014, : 853 - 858
  • [2] Extracting halftones from printed documents using texture analysis
    Dunn, DF
    Weldon, TP
    Higgins, WE
    OPTICAL ENGINEERING, 1997, 36 (04) : 1044 - 1052
  • [3] A Survey on Region Extractors from Web Documents
    Sleiman, Hassan A.
    Corchuelo, Rafael
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (09) : 1960 - 1981
  • [4] Recognition techniques for extracting information from semi-structured documents
    Della Ventura, A
    Gagliardi, I
    Zonta, B
    DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 130 - 137
  • [5] Leveraging Generative Vision Models for Extracting Process Models from Documents
    Voelter, Marvin
    Hadian, Raheleh
    Kampik, Timotheus
    Breitmayer, Marius
    Reichert, Manfred
    BUSINESS PROCESS MANAGEMENT WORKSHOPS, BPM 2024, 2025, 534 : 271 - 282
  • [6] Extracting color halftones from printed documents using texture analysis
    Dunn, DF
    Mathew, NE
    PATTERN RECOGNITION, 2000, 33 (03) : 445 - 463
  • [7] SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)
    Beel, Joeran
    Gipp, Bela
    Shaker, Ammar
    Friedrich, Nick
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 2010, 6273 : 413 - 416
  • [8] Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy
    Wang, Deqing
    Zhang, Hui
    Zhou, Gang
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2009, 5722 : 221 - 230