A New Annotation Method and Dataset for Layout Analysis of Long Documents

被引:2
作者
Ahuja, Aman [1 ]
Dinh, Kevin [1 ]
Dinh, Brian [1 ]
Ingram, William A. [1 ]
Fox, Edward A. [1 ]
机构
[1] Virginia Tech, Blacksburg, VA 24061 USA
来源
COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023 | 2023年
关键词
Object Detection; Scholarly Documents; Electronic Theses and Dissertations; Document Understanding; AI-Aided;
D O I
10.1145/3543873.3587609
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.
引用
收藏
页码:834 / 842
页数:9
相关论文
共 23 条
[1]  
Ahuja Aman, 2022, P 1 WORKSH INF EXTR, P121
[2]  
Dinh Kevin, 2022, Object Detection
[3]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448
[4]   LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking [J].
Huang, Yupan ;
Lv, Tengchao ;
Cui, Lei ;
Lu, Yutong ;
Wei, Furu .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :4083-4091
[5]   FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents [J].
Jaume, Guillaume ;
Ekenel, Hazim Kemal ;
Thiran, Jean-Philippe .
2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW) AND 2ND INTERNATIONAL WORKSHOP ON OPEN SERVICES AND TOOLS FOR DOCUMENT ANALYSIS (OST), VOL 2, 2019, :1-6
[6]   ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations [J].
Kahu, Sampanna Yashwant ;
Ingram, William A. ;
Fox, Edward A. ;
Wu, Jian .
2021 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2021), 2021, :180-191
[7]  
LEBOURGEOIS F, 1992, 11TH IAPR INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, PROCEEDINGS, VOL II, P272, DOI 10.1109/ICPR.1992.201771
[8]  
Li M., 2020, P 28 INT C COMP LING, P949, DOI [10.18653/v1/2020.colingmain.82, DOI 10.18653/V1/2020.COLING-MAIN.82, 10.18653/v1/ 2020.coling-main.82]
[9]  
Li MH, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P1918
[10]  
Lopez P., 2008, Grobid