A model for detecting and merging vertically spanned table cells in plain text documents

被引:2
作者
Long, V [1 ]
Dale, R [1 ]
Cassidy, S [1 ]
机构
[1] Macquarie Univ, Div Informat & Commun Sci, Ctr Language Technol, Sydney, NSW 2109, Australia
来源
EIGHTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, PROCEEDINGS | 2005年
关键词
D O I
10.1109/ICDAR.2005.21
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A spanned cell in a table is a single, complete unit that physically occupies multiple columns and/or multiple rows. Spanned cells are common in tables, and they are a significant cause of error in the extraction of tables from free text documents. In this paper we present a model for the defection and merging of vertically spanned cells for tables presented in plain text documents. Our model and algorithm are based purely on the layout features of the tables, and they require no semantic understanding of the documents. When tested on the 98 tables appearing in 40 randomly selected documents from a corpus of company announcements from the Australian Stock Exchange (ASA), our algorithm achieves an accuracy of 86.79% in detecting and merging vertically spanned cells.
引用
收藏
页码:1242 / 1246
页数:5
相关论文
共 5 条
[1]  
Douglas S., 1995, P 4 ANN S DOC AN INF, P535
[2]  
HU J, 2000, 4 IAPR INT WORKSH DO, P361
[3]  
Ng H.T., 1999, P ANN M ACL, P443
[4]  
Pinto D., 2003, PROC 26 ANN INT ACM, P235, DOI DOI 10.1145/860435.860479
[5]  
Ramel JY, 2003, PROC INT CONF DOC, P374