Figure Metadata Extraction From Digital Documents

被引:24
作者
Choudhury, Sagnik Ray [1 ]
Mitra, Prasenjit [1 ,2 ]
Kirk, Andi [3 ]
Szep, Silvia [3 ]
Pellegrino, Donald [3 ]
Jones, Sue [3 ]
Giles, C. Lee. [1 ,2 ]
机构
[1] Penn State Univ, Informat Sci Technnol, University Pk, PA 16802 USA
[2] Penn State Univ, Comp Sci Engn, University Pk, PA 16802 USA
[3] Dow Chem Co USA, Spring House, PA 19477 USA
来源
2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR) | 2013年
基金
美国国家科学基金会;
关键词
D O I
10.1109/ICDAR.2013.34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis[6]. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.
引用
收藏
页码:135 / 139
页数:5
相关论文
共 9 条
[1]  
[Anonymous], 2004, P 2004 C EMPIRICAL M
[2]   Summarizing Figures, Tables, and Algorithms in Scientific Publications to Augment Search Results [J].
Bhatia, Sumit ;
Mitra, Prasenjit .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2012, 30 (01)
[3]  
Choudhury S. Ray, 2013, PROCEEDINGS OF THE 1
[4]  
Liu RZ, 2007, PROC INT CONF DOC, P521
[5]  
Liu Y., 2009, DOCUMENT ANALYSIS AN, P1006
[6]  
Lopez L., 2011, BIOINFORMATICS AND B, P578
[7]   Automated analysis of images in documents for intelligent document search [J].
Lu, Xiaonan ;
Kataria, Saurabh ;
Brouwer, William J. ;
Wang, James Z. ;
Mitra, Prasenjit ;
Giles, C. Lee .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2009, 12 (02) :65-81
[8]  
Prasad V., 2007, CONTENT BASED MULTIM, P85
[9]  
Savva Manolis, 2011, P 24 ANN ACM S USER, P393, DOI 10.1145/2047196.2047247