Tables to LaTeX: structure and content extraction from scientific tables

被引:5
作者
Kayal, Pratik [1 ]
Anand, Mrinal [1 ]
Desai, Harsh [1 ]
Singh, Mayank [1 ]
机构
[1] Indian Inst Technol Gandhinagar, Dept Comp Sci & Engn, Gandhinagar 382355, India
关键词
Scientific documents; Transformer; LaTeX; Tabular information; Information extraction;
D O I
10.1007/s10032-022-00420-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from tables embedded within PDF research documents is a very challenging task due to the existence of visual features like spanning cells and content features like mathematical symbols and equations. Most existing table structure identification methods tend to ignore these academic writing features. In this paper, we adapt the transformer-based language modeling paradigm for scientific table structure and content extraction. Specifically, the proposed model converts a tabular image to its corresponding LaTeX source code. Overall, we outperform the current state-of-the-art baselines and achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively. Further analysis demonstrates that the proposed models efficiently identify the number of rows and columns, the alphanumeric characters, the LaTeX tokens, and symbols.
引用
收藏
页码:121 / 130
页数:10
相关论文
共 37 条
[1]  
[Anonymous], 2017, PCIM EUR 2017 INT EX
[2]  
[Anonymous], 2005, P IICAI PUN IND 20 2
[3]  
Ba L. J., 2016, arXiv
[4]  
Brischoux F, 2009, SCIENTIST, V23, P24
[5]  
Chi Z., 2019, arXiv
[6]   Design of an end-to-end method to extract information from tables [J].
Costa e Silva, Ana ;
Jorge, Alipio M. ;
Torgo, Luis .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2006, 8 (2-3) :144-171
[7]  
Deng Y., 2017, PR MACH LEARN RES
[8]  
Douglas S., 1995, P 4 ANN S DOC AN INF, P535
[9]   Table-processing paradigms: a research survey [J].
Embley, David W. ;
Hurst, Matthew ;
Lopresti, Daniel ;
Nagy, George .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2006, 8 (2-3) :66-86
[10]  
Feng X, 2020, ARXIV