机构:
Digital Equipment Corp, Syst Res Ctr, Palo Alto, CA 94301 USADigital Equipment Corp, Syst Res Ctr, Palo Alto, CA 94301 USA
Broder, AZ
[1
]
机构:
[1] Digital Equipment Corp, Syst Res Ctr, Palo Alto, CA 94301 USA
来源:
COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS
|
1998年
关键词:
D O I:
10.1109/SEQUEN.1997.666900
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin fingerprints.