All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection

被引:2
作者
Fellah, Aziz [1 ]
机构
[1] Northwest Missouri State Univ, Sch Comp Sci & Informat Syst, Maryville, MO 64468 USA
关键词
Near-duplicate detection; Near-duplicates; Approximate duplicates; Clustering; Data mining applications and discovery; Data cleaning; RECORD-LINKAGE; METHODOLOGY;
D O I
10.1016/j.array.2021.100070
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we propose a general domain-independent approach called Merge-Filter Representative-based Clustering (Merge-Filter-RC) for detecting near-duplicate records within a single and across multiple data sources. Subsequently, we develop three near-optimal classes of algorithms: constant threshold (CT) variable threshold (VT) and function threshold (FT), which we collectively call All-Three algorithms. Merge-Filter-RC and All-Three mold the basis of this work. Merge-Filter-RC works recursively in the spirit of divide-merge fashion for distilling locally and globally near-duplicates as hierarchical clusters along with their prototype representatives. Each cluster is characterized by one or more representatives which are in turn refined dynamically. Representatives are used for further similarity comparisons to reduce the number of pairwise comparisons and consequently the search space. In addition, we segregate the results of the comparisons by labels which we refer to as very similar, similar, or not similar. We complement All-Three algorithms by a more thorough reexamination of the original well-tuned features of the seminal work of Monge-Elkan's (ME) algorithm which we circumvented by an affine variant of the Smith-Waterman's (SW) similarity measure. Using both real-world benchmarks and synthetically generated data sets, we performed several experiments and extensive analysis to show that All-Three algorithms which are rooted in the Merge-Filter-RC approach significantly outperform Monge-Elkan's algorithm in terms of accuracy in detecting near-duplicates. In addition, All-Three algorithms are as efficient in terms of computations as Monge-Elkan's algorithm.
引用
收藏
页数:15
相关论文
共 45 条
  • [1] Sectional MinHash for near-duplicate detection
    Hassanian-esfahani, Roya
    Kargar, Mohammad-javad
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 99 : 203 - 212
  • [2] Apollo: Near-Duplicate Detection for Job Ads in the Online Recruitment Domain
    Burk, Hunter
    Javed, Faizan
    Balaji, Janani
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2017), 2017, : 177 - 182
  • [3] Online Near-Duplicate Detection of News Articles
    Rodier, Simon
    Carter, Dave
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1242 - 1249
  • [4] Video Query Reformulation for Near-Duplicate Detection
    Chiu, Chih-Yi
    Li, Sheng-Yang
    Hsieh, Cheng-Yu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (05) : 1594 - 1603
  • [5] Analysis of Neural Codes for Near-Duplicate Detection
    Pintus, Maurizio
    ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2018, 2018, 11182 : 357 - 368
  • [6] Efficient Similarity Joins for Near-Duplicate Detection
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    Yu, Jeffrey Xu
    Wang, Guoren
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2011, 36 (03):
  • [7] Benchmarking unsupervised near-duplicate image detection
    Morra, Lia
    Lamberti, Fabrizio
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 135 : 313 - 326
  • [8] Codebook-Based Near-Duplicate Video Detection
    Hernandez, Guillermo
    Gonzalez Arrieta, Angelica
    Novais, Paulo
    Rodriguez, Sara
    16TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS (SOCO 2021), 2022, 1401 : 283 - 293
  • [9] Combination of Local and Global Features for Near-Duplicate Detection
    Wang, Yue
    Hou, ZuJun
    Leman, Karianto
    Nam Trung Pham
    Chua, TeckWee
    Chang, Richard
    ADVANCES IN MULTIMEDIA MODELING, PT I, 2011, 6523 : 328 - 338
  • [10] Adaptive Near-Duplicate Detection via Similarity Learning
    Hajishirzi, Hannaneh
    Yih, Wen-tau
    Kolcz, Aleksander
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 419 - 426