Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [1] Reproducing tables in scanned documents
    Jahan, M. A. C. Akmal
    Ragel, Roshan G.
    JOURNAL OF THE NATIONAL SCIENCE FOUNDATION OF SRI LANKA, 2016, 44 (04): : 367 - 377
  • [2] Locating Tables in Scanned Documents for Reconstructing and Republishing
    Jahan, M. A. C. Akmal
    Ragel, Roshan G.
    2014 7TH INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION FOR SUSTAINABILITY (ICIAFS), 2014,
  • [3] Putting Web Tables into Context
    Braunschweig, Katrin
    Thiele, Maik
    Koci, Elvis
    Lehner, Wolfgang
    KDIR: PROCEEDINGS OF THE 8TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL. 1, 2016, : 158 - 165
  • [4] Information Extraction from Handwritten Tables in Historical Documents
    Andres, Jose
    Ramon Prieto, Jose
    Granell, Emilio
    Romero, Veronica
    Andreu Sanchez, Joan
    Vidal, Enrique
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 184 - 198
  • [5] Categorizing images in web documents
    Hu, JY
    Bagga, A
    DOCUMENT RECOGNITION AND RETRIEVAL X, 2003, 5010 : 136 - 143
  • [6] TEXUS: A unified framework for extracting and understanding tables in PDF documents
    Rastan, Roya
    Paik, Hye-Young
    Shepherd, John
    INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (03) : 895 - 918
  • [7] Making Sense of Entities and Quantities in Web Tables
    Ibrahim, Yusra
    Riedewald, Mirek
    Weikum, Gerhard
    CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 1703 - 1712
  • [8] Attribute Retrieval from Relational Web Tables
    Kopliku, Arlind
    Pinel-Sauvagnat, Karen
    Boughanem, Mohand
    STRING PROCESSING AND INFORMATION RETRIEVAL, 2011, 7024 : 117 - 128
  • [9] Creating ontologies from Web documents
    Sánchez, D
    Moreno, A
    RECENT ADVANCES IN ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2004, 113 : 11 - 18
  • [10] Extracting Contextualized Quantity Facts from Web Tables
    Ho, Vinh Thinh
    Pal, Koninika
    Razniewski, Simon
    Berberich, Klaus
    Weikum, Gerhard
    PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 4033 - 4042