Document Similarity Measures and Document Browsing

被引:0
作者
Ahmadullin, Ildus [1 ]
Fan, Jian [2 ]
Damera-Venkata, Niranjan [2 ]
Lim, Suk Hwan [2 ]
Lin, Qian [2 ]
Liu, Jerry [2 ]
Liu, Sam [2 ]
O'Brien-Strain, Eamonn [2 ]
Allebach, Jan [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Hewlett Packard Labs, Palo Alto, CA 94304 USA
来源
IMAGING AND PRINTING IN A WEB 2.0 WORLD II | 2011年 / 7879卷
关键词
D O I
10.1117/12.877268
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
Managing large document databases is an important task today. Being able to automatically compare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We measure single page documents' similarity with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between different documents' components are calculated as probabilistic similarities between corresponding distributions. The similarity measure between documents is represented as a weighted sum of the components' distances. Using this document similarity measure, we propose a browsing mechanism operating on a document dataset. For these purposes, we use a hierarchical browsing environment which we call the document similarity pyramid. It allows the user to browse a large document dataset and to search for documents in the dataset that are similar to the query. The user can browse the dataset on different levels of the pyramid, and zoom into the documents that are of interest.
引用
收藏
页数:8
相关论文
共 10 条
[1]  
Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596
[2]  
[Anonymous], 1968, INFORM THEORY STAT
[3]   Similarity pyramids for browsing and organization of large image databases [J].
Chen, JY ;
Bouman, CA ;
Dalton, JC .
HUMAN VISION AND ELECTRONIC IMAGING III, 1998, 3299 :563-575
[4]   Multiscale Bayesian segmentation using a trainable context model [J].
Cheng, H ;
Bouman, CA .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2001, 10 (04) :511-525
[5]  
GOLDBERGER J, 2003, INT C COMP VIS
[6]  
Gori M., 2003, IEEE T PATTERN RECOG, V25
[7]   A continuous probabilistic framework for image matching [J].
Greenspan, H ;
Goldberger, J ;
Ridel, L .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2001, 84 (03) :384-406
[8]  
Harper M.P., 2004, P IS T ARCH C
[9]   Comparison and Classification of Documents Based on Layout Similarity [J].
Jianying Hu ;
Ramanujan Kashi ;
Gordon Wilfong .
Information Retrieval, 2000, 2 (2-3) :227-243
[10]   A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH [J].
RISSANEN, J .
ANNALS OF STATISTICS, 1983, 11 (02) :416-431