All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection

被引:2
作者
Fellah, Aziz [1 ]
机构
[1] Northwest Missouri State Univ, Sch Comp Sci & Informat Syst, Maryville, MO 64468 USA
关键词
Near-duplicate detection; Near-duplicates; Approximate duplicates; Clustering; Data mining applications and discovery; Data cleaning; RECORD-LINKAGE; METHODOLOGY;
D O I
10.1016/j.array.2021.100070
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we propose a general domain-independent approach called Merge-Filter Representative-based Clustering (Merge-Filter-RC) for detecting near-duplicate records within a single and across multiple data sources. Subsequently, we develop three near-optimal classes of algorithms: constant threshold (CT) variable threshold (VT) and function threshold (FT), which we collectively call All-Three algorithms. Merge-Filter-RC and All-Three mold the basis of this work. Merge-Filter-RC works recursively in the spirit of divide-merge fashion for distilling locally and globally near-duplicates as hierarchical clusters along with their prototype representatives. Each cluster is characterized by one or more representatives which are in turn refined dynamically. Representatives are used for further similarity comparisons to reduce the number of pairwise comparisons and consequently the search space. In addition, we segregate the results of the comparisons by labels which we refer to as very similar, similar, or not similar. We complement All-Three algorithms by a more thorough reexamination of the original well-tuned features of the seminal work of Monge-Elkan's (ME) algorithm which we circumvented by an affine variant of the Smith-Waterman's (SW) similarity measure. Using both real-world benchmarks and synthetically generated data sets, we performed several experiments and extensive analysis to show that All-Three algorithms which are rooted in the Merge-Filter-RC approach significantly outperform Monge-Elkan's algorithm in terms of accuracy in detecting near-duplicates. In addition, All-Three algorithms are as efficient in terms of computations as Monge-Elkan's algorithm.
引用
收藏
页数:15
相关论文
共 45 条
  • [41] Explaining BERT model decisions for near-duplicate news article detection based on named entity recognition
    Novo, Anne Stockem
    Gedikli, Fatih
    2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, : 278 - 281
  • [42] TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection
    Wang, Zhizhi
    Zuo, Chaoji
    Deng, Dong
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1146 - 1159
  • [43] LEVERAGING AN IMAGE FOLKSONOMY AND THE SIGNATURE QUADRATIC FORM DISTANCE FOR SEMANTIC-BASED DETECTION OF NEAR-DUPLICATE VIDEO CLIPS
    Min, Hyun-seok
    Choi, Jae Young
    Neve, Wesley De
    Ro, Yong Man
    2011 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2011,
  • [44] Efficient Self-similarity Range Wide-joins Fostering Near-duplicate Image Detection in Emergency Scenarios
    Carvalho, Luiz Olmes
    Santos, Lucio F. D.
    Oliveira, Willian D.
    Traina, Agma J. M.
    Traina, Caetano, Jr.
    PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 1 (ICEIS), 2016, : 81 - 91
  • [45] Bimodal fusion of low-level visual features and high-level semantic features for near-duplicate video clip detection
    Min, Hyun-seok
    Choi, Jae Young
    De Neve, Wesley
    Ro, Yong Man
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2011, 26 (10) : 612 - 627