Detecting tables in web documents

被引：14

作者：

Kim, YS ^{[1
]}

Lee, KH ^{[1
]}

机构：

[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2005年 / 18卷 / 06期

关键词：

table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;

D O I：

10.1016/j.engappai.2005.01.009

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.

引用

页码：745 / 757

页数：13

共 50 条

[41] Complexity Analysis of Legal Documents
Ramaswamy, Sankar
Sreelekshmi, R.
Veena, G.
ARTIFICIAL INTELLIGENCE: THEORY AND APPLICATIONS, VOL 1, AITA 2023, 2024, 843 : 141 - 154
[42] Table understanding in structured documents
Holecek, Martin
Hoskovec, Antonin
Baudis, Petr
Klinger, Pavel
2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 158 - 164
[43] Text Simplification of Patent Documents
Kang, Jeongwoo
Souili, Achille
Cavallucci, Denis
AUTOMATED INVENTION FOR SMART INDUSTRIES, 2018, 541 : 225 - 237
[44] A framework for information extraction from tables in biomedical literature
Nikola Milosevic
Cassie Gregson
Robert Hernandez
Goran Nenadic
International Journal on Document Analysis and Recognition (IJDAR), 2019, 22 : 55 - 78
[45] Extracting logical structures from HTML']HTML tables
Kim, Yeon-Seok
Lee, Kyong-Ho
COMPUTER STANDARDS & INTERFACES, 2008, 30 (05) : 296 - 308
[46] The Aware Toolbox for the Detection of Law Infringements on Web Pages
Shahab, Asif
Kieninger, Thomas
Dengel, Andreas
DOCUMENT RECOGNITION AND RETRIEVAL XVII, 2010, 7534
[47] A framework for information extraction from tables in biomedical literature
Milosevic, Nikola
Gregson, Cassie
Hernandez, Robert
Nenadic, Goran
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2019, 22 (01) : 55 - 78
[48] Identifying sound descriptions in written documents
Mpouli, Suzanne
Largeron, Christine
Beigbeder, Michel
2019 13TH INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE (RCIS), 2019, : 337 - 342
[49] Mapping Historical Documents to Geographical Space
Hirayama, Takumi
Nanba, Hidetsugu
Takezawa, Toshiyuki
ADJUNCT PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON MOBILE AND UBIQUITOUS SYSTEMS: COMPUTING NETWORKING AND SERVICES (MOBIQUITOUS 2016), 2016, : 142 - 146
[50] KIETA: Key-insight extraction from scientific tables
Kempf, Sebastian
Krug, Markus
Puppe, Frank
APPLIED INTELLIGENCE, 2023, 53 (08) : 9513 - 9530

← 1 2 3 4 5 →