All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection

被引:2
作者
Fellah, Aziz [1 ]
机构
[1] Northwest Missouri State Univ, Sch Comp Sci & Informat Syst, Maryville, MO 64468 USA
关键词
Near-duplicate detection; Near-duplicates; Approximate duplicates; Clustering; Data mining applications and discovery; Data cleaning; RECORD-LINKAGE; METHODOLOGY;
D O I
10.1016/j.array.2021.100070
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we propose a general domain-independent approach called Merge-Filter Representative-based Clustering (Merge-Filter-RC) for detecting near-duplicate records within a single and across multiple data sources. Subsequently, we develop three near-optimal classes of algorithms: constant threshold (CT) variable threshold (VT) and function threshold (FT), which we collectively call All-Three algorithms. Merge-Filter-RC and All-Three mold the basis of this work. Merge-Filter-RC works recursively in the spirit of divide-merge fashion for distilling locally and globally near-duplicates as hierarchical clusters along with their prototype representatives. Each cluster is characterized by one or more representatives which are in turn refined dynamically. Representatives are used for further similarity comparisons to reduce the number of pairwise comparisons and consequently the search space. In addition, we segregate the results of the comparisons by labels which we refer to as very similar, similar, or not similar. We complement All-Three algorithms by a more thorough reexamination of the original well-tuned features of the seminal work of Monge-Elkan's (ME) algorithm which we circumvented by an affine variant of the Smith-Waterman's (SW) similarity measure. Using both real-world benchmarks and synthetically generated data sets, we performed several experiments and extensive analysis to show that All-Three algorithms which are rooted in the Merge-Filter-RC approach significantly outperform Monge-Elkan's algorithm in terms of accuracy in detecting near-duplicates. In addition, All-Three algorithms are as efficient in terms of computations as Monge-Elkan's algorithm.
引用
收藏
页数:15
相关论文
共 45 条
  • [21] Filtering Image Spam using Image Semantics and Near-Duplicate Detection
    Qu, Zhaoyang
    Zhang, Yingjin
    ICICTA: 2009 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTATION TECHNOLOGY AND AUTOMATION, VOL I, PROCEEDINGS, 2009, : 600 - 603
  • [22] Near-Duplicate Detection Using GPU-based Simhash Scheme
    Feng, Xiaowen
    Jin, Hai
    Zheng, Ran
    Zhu, Lei
    2014 INTERNATIONAL CONFERENCE ON SMART COMPUTING (SMARTCOMP), 2014,
  • [23] Clip-based hierarchical representation for near-duplicate video detection
    Paisitkriangkrai, Sakrapee
    Mei, Tao
    Zhang, Jian
    Hua, Xian-Sheng
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2011, 88 (18) : 3817 - 3833
  • [24] Constructing Social Networks Based on Near-Duplicate Detection in YouTube Videos
    Yu, Tianyuan
    Bai, Liang
    Guo, Jinlin
    Yang, Zheng
    2015 1ST IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2015, : 40 - 47
  • [25] Near-Duplicate Web Page Detection by Enhanced TDW and simHash Technique
    Arun, P. R.
    Sumesh, M. S.
    2015 INTERNATIONAL CONFERENCE ON COMPUTING AND NETWORK COMMUNICATIONS (COCONET), 2015, : 765 - 770
  • [26] A pruning strategy to improve pairwise comparison-based near-duplicate detection
    Hassanian-esfahani, Roya
    Kargar, Mohammad-javad
    KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 61 (02) : 931 - 963
  • [27] A pruning strategy to improve pairwise comparison-based near-duplicate detection
    Roya Hassanian-esfahani
    Mohammad-javad Kargar
    Knowledge and Information Systems, 2019, 61 : 931 - 963
  • [28] Consumer photo management and browsing facilitated by near-duplicate detection with feature filtering
    Chu, Wei-Ta
    Lin, Chia-Hung
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (03) : 256 - 268
  • [29] Aggregating Sentence-level Features for Chinese Near-duplicate Document Detection
    Liang, Yan
    Tao, Yizheng
    Feng, Ning
    Wan, Zhenjing
    Xu, Feng
    Jiang, Xue
    Gao, Shan
    PROCEEDINGS OF THE 2017 IEEE 14TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC 2017), 2017, : 174 - 179
  • [30] PhotoCluster A Multi-clustering Technique for Near-duplicate Detection in Personal Photo Collections
    Vonikakis, Vassilios
    Jinda-Apiraksa, Amornched
    Winkler, Stefan
    PROCEEDINGS OF THE 2014 9TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, THEORY AND APPLICATIONS (VISAPP 2014), VOL 2, 2014, : 153 - 161