Automated analysis of images in documents for intelligent document search

被引:38
作者
Lu, Xiaonan [1 ]
Kataria, Saurabh [2 ]
Brouwer, William J. [3 ]
Wang, James Z. [1 ,2 ]
Mitra, Prasenjit [1 ,2 ]
Giles, C. Lee [1 ,2 ]
机构
[1] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
[2] Penn State Univ, Coll Informat Sci & Technol, University Pk, PA 16802 USA
[3] Penn State Univ, Dept Chem, University Pk, PA 16802 USA
基金
美国国家科学基金会;
关键词
Image; Document search; Figure; 2-D plot; Data extraction; Text block extraction; CLASSIFICATION; EXTRACTION; CHARACTERS;
D O I
10.1007/s10032-009-0081-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.
引用
收藏
页码:65 / 81
页数:17
相关论文
共 46 条
[1]  
[Anonymous], 1998, PAPER PRESENTED 10 E
[2]  
Antani S., 1999, CSE99016 PENNS STAT
[3]  
Blostein D., 2000, P INT C THEOR APPL D, P330
[4]  
BOUAZIZ B, 2006, P INT C IM AN REC, P4414
[6]  
Carberry S., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P581, DOI 10.1145/1148170.1148270
[7]   Support vector machines for histogram-based image classification [J].
Chapelle, O ;
Haffner, P ;
Vapnik, VN .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05) :1055-1064
[8]  
Datta R, 2005, P 7 ACM SIGMM INT WO, P153, DOI [DOI 10.1145/1101826.1101866, 10.1145/1101826.1101866]
[9]   USE OF HOUGH TRANSFORMATION TO DETECT LINES AND CURVES IN PICTURES [J].
DUDA, RO ;
HART, PE .
COMMUNICATIONS OF THE ACM, 1972, 15 (01) :11-&
[10]   A ROBUST ALGORITHM FOR TEXT STRING SEPARATION FROM MIXED TEXT GRAPHICS IMAGES [J].
FLETCHER, LA ;
KASTURI, R .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1988, 10 (06) :910-918