Text Classification and Document Layout Analysis of Paper Fragments

被引:8
作者
Diem, Markus [1 ]
Kleber, Florian [1 ]
Sablatnig, Robert [1 ]
机构
[1] Vienna Univ Technol, Comp Vision Lab, Vienna, Austria
来源
11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011) | 2011年
关键词
local features; text classification; layout analysis;
D O I
10.1109/ICDAR.2011.175
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In general document image analysis methods are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed. Therefore, words are classified according to printed text, manuscripts, and noise. Where, the third class corrects falsely segmented background elements. Having classified text elements, a layout analysis is carried out which groups words into text lines and paragraphs. A back propagation of the class weights - assigned to each word in the first step - enables correcting wrong class labels. The proposed method shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout. In addition, the system is compared to page segmentation methods of the ICDAR 2009 Page Segmentation Competition.
引用
收藏
页码:854 / 858
页数:5
相关论文
共 14 条
[1]  
[Anonymous], 1983, P IEEE MEL
[2]  
Antonacopoulos A., 2009, 2009 10th International Conference on Document Analysis and Recognition (ICDAR), P1370, DOI 10.1109/ICDAR.2009.275
[3]  
Bar-Yosef Itay, 2009, 2009 10th International Conference on Document Analysis and Recognition (ICDAR), P1161, DOI 10.1109/ICDAR.2009.191
[4]   Recognition of table of contents for electronic library consulting [J].
Belaïd A. .
International Journal on Document Analysis and Recognition, 2001, 4 (01) :35-45
[5]  
Bukhari Syed Saqib, 2009, 2009 10th International Conference on Document Analysis and Recognition (ICDAR), P446, DOI 10.1109/ICDAR.2009.206
[6]  
Chanda S., 2010, Proceedings 2010 12th International Conference on Frontiers in Handwriting Recognition (ICFHR 2010), P25, DOI 10.1109/ICFHR.2010.12
[7]  
Diem M., 2010, Document analysis applied to fragments, P393, DOI DOI 10.1145/1815330.1815381
[8]  
Kandan R, 2007, LECT NOTES COMPUT SC, V4842, P96
[9]  
Kuhnke K., 1995, Proceedings of the Third International Conference on Document Analysis and Recognition, P811, DOI 10.1109/ICDAR.1995.602025
[10]   A performance evaluation of local descriptors [J].
Mikolajczyk, K ;
Schmid, C .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (10) :1615-1630