FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

被引:143
作者
Jaume, Guillaume [1 ]
Ekenel, Hazim Kemal [2 ]
Thiran, Jean-Philippe [1 ]
机构
[1] Swiss Fed Inst Technol, Signal Proc Lab 5, Lausanne, Switzerland
[2] Istanbul Tech Univ, Dept Comp Engn, Istanbul, Turkey
来源
2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW) AND 2ND INTERNATIONAL WORKSHOP ON OPEN SERVICES AND TOOLS FOR DOCUMENT ANALYSIS (OST), VOL 2 | 2019年
关键词
Text detection; Optical Character Recognition; Form Understanding; Spatial Layout Analysis; ALGORITHMS;
D O I
10.1109/ICDARW.2019.10029
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a new dataset for form understanding in noisy scanned documents (FUNSD) that aims at extracting and structuring the textual content of forms. The dataset comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking. To the best of our knowledge, this is the first publicly available dataset with comprehensive annotations to address FoUn task.
引用
收藏
页码:1 / 6
页数:6
相关论文
共 21 条
[1]  
[Anonymous], MASK R CNN PYRAMID A
[2]  
[Anonymous], 2017, FUSED TEXT SEGMENTAT
[3]  
[Anonymous], EUR C COMP VIS ECCV
[4]  
[Anonymous], ADV NEURAL INFORM PR, DOI DOI 10.1109/TPAMI.2016.2577031
[5]  
[Anonymous], 2018, 2018 26 SIGN PROC CO, DOI DOI 10.1109/SIU.2018.8404746
[6]   Rosetta: Large Scale System for Text Detection and Recognition in Images [J].
Borisyuk, Fedor ;
Gordo, Albert ;
Sivakumar, Viswanath .
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, :71-79
[7]   ICDAR2017 Competition on Recognition of Documents with Complex Layouts-RDCL2017 [J].
Clausner, C. ;
Antonacopoulos, A. ;
Pletschacher, S. .
2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, :1404-1410
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]   A comprehensive survey of mostly textual document segmentation algorithms since 2008 [J].
Eskenazi, Sebastien ;
Gomez-Kramer, Petra ;
Ogier, Jean-Marc .
PATTERN RECOGNITION, 2017, 64 :1-14
[10]   Interpreting data from scanned tables [J].
Farrukh, Waleed ;
Foncubierta-Rodriguez, Antonio ;
Ciubotaru, Anca-Nicoleta ;
Jaume, Guillaume ;
Bekas, Costas ;
Goksel, Orcun ;
Gabrani, Maria .
2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2017), VOL 2, 2017, :5-6