Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [21] Extraction of Information from Public Health Emergency Web Documents
    Wang, Li
    Zhang, Yuanpeng
    Qian, Danmin
    Yao, Min
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 : 765 - 770
  • [22] Ontology creation: Extraction of domain knowledge from web documents
    Storey, VC
    Chiang, R
    Chen, GL
    CONCEPTUAL MODELING - ER 2005, 2005, 3716 : 256 - 269
  • [23] Ontology based semantic annotation of Urdu language web documents
    Rajput, Quratulain
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS 18TH ANNUAL CONFERENCE, KES-2014, 2014, 35 : 662 - 670
  • [24] A Web-Based Tool for Analysing Normative Documents in English
    Camilleri, John J.
    Haghshenas, Mohammad Reza
    Schneider, Gerardo
    33RD ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2018, : 1865 - 1872
  • [25] Deformable Convolutional Neuron Network Model for Detecting Tables and Columns from Document Images
    Lee, Wen-Tin
    Huang, Chuan-Chun
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2022, 38 (06) : 1305 - 1315
  • [26] EGA: An algorithm for automatic semi-structured Web documents extraction
    Li, LY
    Tang, SW
    Yang, DQ
    Wang, TJ
    Su, ZH
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2004, 2973 : 787 - 798
  • [27] Learning Transferable Node Representations for Attribute Extraction from Web Documents
    Zhou, Yichao
    Sheng, Ying
    Vo, Nguyen
    Edmonds, Nick
    Tata, Sandeep
    WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 1479 - 1487
  • [28] EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD
    Mohammadzadeh, Hadi
    Gottron, Thomas
    Schweiggert, Franz
    Nakhaeizadeh, Gholamreza
    KDIR 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND INFORMATION RETRIEVAL, 2011, : 470 - 475
  • [29] Information Extraction from Web Documents Based on unranked Tree Automaton Inference
    Huang Zhaohua
    Yang Fan
    2012 FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY (MINES 2012), 2012, : 195 - 198
  • [30] Extracting Events from Web Documents for Social Media Monitoring Using Structured SVM
    Choi, Yoonjae
    Ryu, Pum-Mo
    Kim, Hyunki
    Lee, Changki
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (06) : 1410 - 1414