Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [41] Complexity Analysis of Legal Documents
    Ramaswamy, Sankar
    Sreelekshmi, R.
    Veena, G.
    ARTIFICIAL INTELLIGENCE: THEORY AND APPLICATIONS, VOL 1, AITA 2023, 2024, 843 : 141 - 154
  • [42] Table understanding in structured documents
    Holecek, Martin
    Hoskovec, Antonin
    Baudis, Petr
    Klinger, Pavel
    2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 158 - 164
  • [43] Text Simplification of Patent Documents
    Kang, Jeongwoo
    Souili, Achille
    Cavallucci, Denis
    AUTOMATED INVENTION FOR SMART INDUSTRIES, 2018, 541 : 225 - 237
  • [44] A framework for information extraction from tables in biomedical literature
    Nikola Milosevic
    Cassie Gregson
    Robert Hernandez
    Goran Nenadic
    International Journal on Document Analysis and Recognition (IJDAR), 2019, 22 : 55 - 78
  • [45] Extracting logical structures from HTML']HTML tables
    Kim, Yeon-Seok
    Lee, Kyong-Ho
    COMPUTER STANDARDS & INTERFACES, 2008, 30 (05) : 296 - 308
  • [46] The Aware Toolbox for the Detection of Law Infringements on Web Pages
    Shahab, Asif
    Kieninger, Thomas
    Dengel, Andreas
    DOCUMENT RECOGNITION AND RETRIEVAL XVII, 2010, 7534
  • [47] A framework for information extraction from tables in biomedical literature
    Milosevic, Nikola
    Gregson, Cassie
    Hernandez, Robert
    Nenadic, Goran
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2019, 22 (01) : 55 - 78
  • [48] Identifying sound descriptions in written documents
    Mpouli, Suzanne
    Largeron, Christine
    Beigbeder, Michel
    2019 13TH INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE (RCIS), 2019, : 337 - 342
  • [49] Mapping Historical Documents to Geographical Space
    Hirayama, Takumi
    Nanba, Hidetsugu
    Takezawa, Toshiyuki
    ADJUNCT PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON MOBILE AND UBIQUITOUS SYSTEMS: COMPUTING NETWORKING AND SERVICES (MOBIQUITOUS 2016), 2016, : 142 - 146
  • [50] KIETA: Key-insight extraction from scientific tables
    Kempf, Sebastian
    Krug, Markus
    Puppe, Frank
    APPLIED INTELLIGENCE, 2023, 53 (08) : 9513 - 9530