Detecting tables in web documents

被引：14

作者：

Kim, YS ^{[1
]}

Lee, KH ^{[1
]}

机构：

[1] Yonsei Univ, Dept Comp Sci, Seodaemun Ku, Seoul 120749, South Korea

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2005年 / 18卷 / 06期

关键词：

table detection; !text type='HTML']HTML[!/text] document; web document analysis; attribute-value relations extraction; information extraction;

D O I：

10.1016/j.engappai.2005.01.009

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%. (c) 2005 Elsevier Ltd. All rights reserved.

引用

页码：745 / 757

页数：13

共 50 条

[1] Reproducing tables in scanned documents
Jahan, M. A. C. Akmal
Ragel, Roshan G.
JOURNAL OF THE NATIONAL SCIENCE FOUNDATION OF SRI LANKA, 2016, 44 (04): : 367 - 377
[2] Locating Tables in Scanned Documents for Reconstructing and Republishing
Jahan, M. A. C. Akmal
Ragel, Roshan G.
2014 7TH INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION FOR SUSTAINABILITY (ICIAFS), 2014,
[3] Putting Web Tables into Context
Braunschweig, Katrin
Thiele, Maik
Koci, Elvis
Lehner, Wolfgang
KDIR: PROCEEDINGS OF THE 8TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL. 1, 2016, : 158 - 165
[4] Information Extraction from Handwritten Tables in Historical Documents
Andres, Jose
Ramon Prieto, Jose
Granell, Emilio
Romero, Veronica
Andreu Sanchez, Joan
Vidal, Enrique
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 184 - 198
[5] Categorizing images in web documents
Hu, JY
Bagga, A
DOCUMENT RECOGNITION AND RETRIEVAL X, 2003, 5010 : 136 - 143
[6] TEXUS: A unified framework for extracting and understanding tables in PDF documents
Rastan, Roya
Paik, Hye-Young
Shepherd, John
INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (03) : 895 - 918
[7] Making Sense of Entities and Quantities in Web Tables
Ibrahim, Yusra
Riedewald, Mirek
Weikum, Gerhard
CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 1703 - 1712
[8] Attribute Retrieval from Relational Web Tables
Kopliku, Arlind
Pinel-Sauvagnat, Karen
Boughanem, Mohand
STRING PROCESSING AND INFORMATION RETRIEVAL, 2011, 7024 : 117 - 128
[9] Creating ontologies from Web documents
Sánchez, D
Moreno, A
RECENT ADVANCES IN ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2004, 113 : 11 - 18
[10] Extracting Contextualized Quantity Facts from Web Tables
Ho, Vinh Thinh
Pal, Koninika
Razniewski, Simon
Berberich, Klaus
Weikum, Gerhard
PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 4033 - 4042

← 1 2 3 4 5 →