Detecting tables in web documents

被引:14
|
作者
Kim, YS [1 ]
Lee, KH [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea
关键词
table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;
D O I
10.1016/j.engappai.2005.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:745 / 757
页数:13
相关论文
共 50 条
  • [31] Post-supervised template induction for information extraction from lists and tables in dynamic web sources
    Shi, Z
    Milios, E
    Zincir-Heywood, N
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2005, 25 (01) : 69 - 93
  • [32] Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources
    Z. Shi
    E. Milios
    N. Zincir-Heywood
    Journal of Intelligent Information Systems, 2005, 25 : 69 - 93
  • [33] Tables to LaTeX: structure and content extraction from scientific tables
    Kayal, Pratik
    Anand, Mrinal
    Desai, Harsh
    Singh, Mayank
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2023, 26 (02) : 121 - 130
  • [34] Tables to LaTeX: structure and content extraction from scientific tables
    Pratik Kayal
    Mrinal Anand
    Harsh Desai
    Mayank Singh
    International Journal on Document Analysis and Recognition (IJDAR), 2023, 26 : 121 - 130
  • [35] Towards a theory of tables
    Matthew Hurst
    International Journal of Document Analysis and Recognition (IJDAR), 2006, 8 : 123 - 131
  • [36] PADI-web: An Event-Based Surveillance System for Detecting, Classifying and Processing Online News
    Valentin, Sarah
    Arsevska, Elena
    Mercier, Alize
    Falala, Sylvain
    Rabatel, Julien
    Lancelot, Renaud
    Roche, Mathieu
    HUMAN LANGUAGE TECHNOLOGY. CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2017, 2020, 12598 : 87 - 101
  • [37] Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices
    Kang, Jinbeom
    Yang, Jaeyoung
    Choi, Joongmin
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (02) : 980 - 986
  • [38] Towards a theory of tables
    Hurst, Matthew
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2006, 8 (2-3) : 123 - 131
  • [39] Transforming a nonstandard table into formalized tables
    Su, Huili
    Li, Yukun
    Wang, Xiaoye
    Hao, Gang
    Lai, Yongxuan
    Wang, Weiwei
    2017 14TH WEB INFORMATION SYSTEMS AND APPLICATIONS CONFERENCE (WISA 2017), 2017, : 311 - 316
  • [40] Table Recognition in Scanned Documents
    Kazdar, Takwa
    Jmal, Marwa
    Souidene, Wided
    Attia, Rabah
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 13501 : 744 - 754