Duplicate document detection in DocBrowse

被引：1

作者：

Chalana, V ^{[1
]}

Bruce, A ^{[1
]}

Nguyen, T ^{[1
]}

机构：

[1] Mathsoft Data Anal Prod Div, Seattle, WA 98109 USA

来源：

DOCUMENT RECOGNITION V | 1998年 / 3305卷

关键词：

duplicate document detection; document imaging; document image database; optical character recognition; OCR; information retrieval; wavelet transform;

D O I：

10.1117/12.304630

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithms to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average Ii-point precision of 97.7% while the image-based method has an average Ii-point precision of 98.9%. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image-based method performs better when the document contains little or no quality machine readable text.

引用

页码：169 / 178

页数：10