TABLEX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

被引:11
作者
Desai, Harsh [1 ]
Kayal, Pratik [1 ]
Singh, Mayank [1 ]
机构
[1] Indian Inst Technol, Gandhinagar, India
来源
DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II | 2021年 / 12822卷
关键词
Information Extraction; LATEX; Scientific articles;
D O I
10.1007/978-3-030-86331-9_36
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TABLEX, a large-scale benchmark dataset comprising table images generated from scientific articles. TABLEX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding LATEX source code. To facilitate the development of robust table IE tools, TABLEX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to augment this dataset with more complex and diverse tables at regular intervals.
引用
收藏
页码:554 / 569
页数:16
相关论文
共 31 条
  • [1] Chi Zewen, 2019, ARXIV190804729
  • [2] Deng Y., 2017, PR MACH LEARN RES, P980
  • [3] Deng Y., 2016, arXiv preprint arXiv:1609.04938
  • [4] Douglas S., 1995, P 4 ANN S DOC AN INF, P535
  • [5] Table-processing paradigms: a research survey
    Embley, David W.
    Hurst, Matthew
    Lopresti, Daniel
    Nagy, George
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2006, 8 (2-3) : 66 - 86
  • [6] Feng X., 2020, ARXIV PREPRINT ARXIV
  • [7] ICDAR 2013 Table Competition
    Goebel, Max
    Hassan, Tamir
    Oro, Ermelinda
    Orsi, Giorgio
    [J]. 2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 1449 - 1453
  • [8] A Table Detection Method for PDF Documents Based on Convolutional Neural Networks
    Hao, Leipeng
    Gao, Liangcai
    Yi, Xiaohan
    Tang, Zhi
    [J]. PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, : 287 - 292
  • [9] He KM, 2017, IEEE I CONF COMP VIS, P2980, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
  • [10] Jing Fang, 2012, Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (DAS 2012), P445, DOI 10.1109/DAS.2012.29