ClusTi: Clustering Method for Table Structure Recognition in Scanned Images

被引:6
作者
Zucker, Arthur [1 ]
Belkada, Younes [1 ]
Hanh Vu [2 ]
Van Nam Nguyen [3 ]
机构
[1] Sorbonne Univ, Polytech Sorbonne, F-75005 Paris, France
[2] Viettel CyberSpace Ctr, 41st Floor,Keangnam Landmark 72, Hanoi, Vietnam
[3] Thuyloi Univ, Comp Sci & Engn Dept, 175 TaySon, Hanoi, Vietnam
关键词
Table structure recognition; Object recognition; Clustering method;
D O I
10.1007/s11036-021-01759-9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
OCR (Optical Character Recognition) for scanned paper invoices is very challenging due to the variability of 19 invoice layouts, different information fields, large data tables, and low scanning quality. In this case, table structure recognition is a critical task in which all rows, columns, and cells must be accurately positioned and extracted. Existing methods such as DeepDeSRT only dealt with high-quality born-digital images (e.g., PDF) with low noise and apparent table structure. This paper proposes an efficient method called CluSTi (Clustering method for recognition of the Structure of Tables in invoice scanned Images). The contributions of CluSTi are three-fold. Firstly, it removes heavy noises in the table images using a clustering algorithm. Secondly, it extracts all text boxes using state-of-the-art text recognition. Thirdly, based on the horizontal and vertical clustering algorithm with optimized parameters, CluSTi groups the text boxes into their correct rows and columns, respectively. The method was evaluated on three datasets: i) 397 public scanned images; ii) 193 PDF document images from ICDAR 2013 competition dataset; and iii) 281 PDF document images from ICDAR 2019's numeric tables. The evaluation results showed that CluSTi achieved an F-1-score of 87.5%, 98.5%, and 94.5%, respectively. Our method also outperformed DeepDeSRT with an F-1-score of 91.44% on only 34 images from the ICDAR 2013 competition dataset. To the best of our knowledge, CluSTi is the first method to tackle the table structure recognition problem on scanned images.
引用
收藏
页码:1765 / 1776
页数:12
相关论文
共 36 条
[1]  
[Anonymous], 2017, ARXIV170106751
[2]  
[Anonymous], 2011, Int. J. Comput. Appl.
[3]   Character Region Awareness for Text Detection [J].
Baek, Youngmin ;
Lee, Bado ;
Han, Dongyoon ;
Yun, Sangdoo ;
Lee, Hwalsuk .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9357-9366
[4]   Support vector clustering [J].
Ben-Hur, A ;
Horn, D ;
Siegelmann, HT ;
Vapnik, V .
JOURNAL OF MACHINE LEARNING RESEARCH, 2002, 2 (02) :125-137
[5]   Comparing Machine Learning Approaches for Table Recognition in Historical Register Books [J].
Clinchant, Stephane ;
Dejean, Herve ;
Meunier, Jean-Luc ;
Lang, Eva ;
Kleber, Florian .
2018 13TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS), 2018, :133-138
[6]  
Deng D, 2018, AAAI CONF ARTIF INTE, P6773
[7]  
Ester M., P 2 INT C KNOWL DISC, P226, DOI DOI 10.5555/3001460.3001507
[8]  
Farahmand Atena, 2013, IMECS 2013 Proceedings of International Multiconference of Engineers and Computer Scientists, P436
[9]  
Fields CR., 2001, ICML
[10]   Clustering by passing messages between data points [J].
Frey, Brendan J. ;
Dueck, Delbert .
SCIENCE, 2007, 315 (5814) :972-976