Duplicate document detection by template matching

被引:12
作者
Caprari, RS [1 ]
机构
[1] Def Sci & Technol Org, Salisbury, SA 5108, Australia
关键词
template matching; correlation; literal similarity; binary pattern recognition; duplicate document detection; facsimile document analysis;
D O I
10.1016/S0262-8856(99)00086-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We discuss some operational issues pertaining to the detection of duplicates in the databases of bitmapped binary document images, and reason that efficient and effective duplicate document detection probably needs a combination of an efficient primary detector and an effective subordinate detector to be achieved. An algorithm that executes binary pattern template matching by cross-correlation is proposed as a duplicate document detection methodology. The template matching operation is amenable to pixel-parallel computation on serial architecture computers by bitwise integer operations. A description of the algorithm is accompanied by a discussion of issues that arise in its practical implementation. Duplicate detection by template matching is especially well suited to facsimile (i.e. fax) databases, in particular for detecting the single feed-multiple transmissions that often dominate the occurrence of duplicates in fax databases. Detailed experimental results presented for fax documents demonstrate that template matching is suitable as both a primary detector when conducted with small template and search area sizes, and a subordinate detector when conducted with moderate template and search area sizes. (C) 2000 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:633 / 643
页数:11
相关论文
共 13 条
[1]  
Anderberg M.R., 1973, Probability and Mathematical Statistics
[2]  
[Anonymous], 1987, MACQUARIE DICT
[3]  
CAPRARI R, 2000, IN PRESS PATT RECOG, V21
[4]   Duplicate document detection in DocBrowse [J].
Chalana, V ;
Bruce, A ;
Nguyen, T .
DOCUMENT RECOGNITION V, 1998, 3305 :169-178
[5]   The detection of duplicates in document image databases [J].
Doermann, D ;
Li, HP ;
Kia, O .
IMAGE AND VISION COMPUTING, 1998, 16 (12-13) :907-920
[6]  
HE Y, 1999, P 5 INT C DOC AN REC, P685
[7]  
Hull J. J., 1998, International Journal on Document Analysis and Recognition, V1, P37
[8]  
Hull J. J., 1995, International Association for Pattern Recognition Workshop on Document Analysis Systems, P379
[9]  
LEE DS, 1999, P 5 INT C DOC AN REC, P305
[10]  
Lopresti D. P., 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318), P297, DOI 10.1109/ICDAR.1999.791783