PDFFigures 2.0: Mining Figures from Research Papers

被引:76
作者
Clark, Christopher [1 ]
Divvala, Santosh [1 ]
机构
[1] Univ Washington, Allen Inst Artificial Intelligence, Seattle, WA 98195 USA
来源
2016 IEEE/ACM JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL) | 2016年
关键词
Scalable figure extraction; academic search engine; section title extraction; figure usage analysis;
D O I
10.1145/2910896.2910904
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called "PDFFigures 2.0." Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text. To evaluate our work, we introduce a new dataset of computer science papers, along with ground truth labels for the locations of the figures, tables, and captions within them. Our algorithm achieves impressive results (94% precision at 90% recall) on this dataset surpassing previous state of the art. Further, we show how our framework was used to extract figures from a corpus of over one million papers, and how the resulting extractions were integrated into the user interface of a smart academic search engine, Semantic Scholar (www.semanticscholar.org). Finally, we present results of exploratory data analysis completed on the extracted figures as well as an extension of our method for the task of section title extraction.
引用
收藏
页码:143 / 152
页数:10
相关论文
共 12 条
  • [1] [Anonymous], DOCENG
  • [2] [Anonymous], 2010, IJCV
  • [3] Clark ChristopherAndreas., 2015, Workshops at the 29th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, AAAI '15, P2
  • [4] Councill I. G., 2008, LREC
  • [5] Lopez P., 2009, RES ADV TECHNOLOGY D
  • [6] Luong M.-T., 2011, IJDLS
  • [7] Praczyk P. A., 2013, INFORM TECHNOLOGY LI
  • [8] Sagnik P. M. Choudhury, 2015, GREC
  • [9] Sculley D., 2014, NIPS SOFTW ENG MACH
  • [10] Siegel N., 2015, UNDERSTANDING CHARTS