All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection

被引:2
作者
Fellah, Aziz [1 ]
机构
[1] Northwest Missouri State Univ, Sch Comp Sci & Informat Syst, Maryville, MO 64468 USA
关键词
Near-duplicate detection; Near-duplicates; Approximate duplicates; Clustering; Data mining applications and discovery; Data cleaning; RECORD-LINKAGE; METHODOLOGY;
D O I
10.1016/j.array.2021.100070
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we propose a general domain-independent approach called Merge-Filter Representative-based Clustering (Merge-Filter-RC) for detecting near-duplicate records within a single and across multiple data sources. Subsequently, we develop three near-optimal classes of algorithms: constant threshold (CT) variable threshold (VT) and function threshold (FT), which we collectively call All-Three algorithms. Merge-Filter-RC and All-Three mold the basis of this work. Merge-Filter-RC works recursively in the spirit of divide-merge fashion for distilling locally and globally near-duplicates as hierarchical clusters along with their prototype representatives. Each cluster is characterized by one or more representatives which are in turn refined dynamically. Representatives are used for further similarity comparisons to reduce the number of pairwise comparisons and consequently the search space. In addition, we segregate the results of the comparisons by labels which we refer to as very similar, similar, or not similar. We complement All-Three algorithms by a more thorough reexamination of the original well-tuned features of the seminal work of Monge-Elkan's (ME) algorithm which we circumvented by an affine variant of the Smith-Waterman's (SW) similarity measure. Using both real-world benchmarks and synthetically generated data sets, we performed several experiments and extensive analysis to show that All-Three algorithms which are rooted in the Merge-Filter-RC approach significantly outperform Monge-Elkan's algorithm in terms of accuracy in detecting near-duplicates. In addition, All-Three algorithms are as efficient in terms of computations as Monge-Elkan's algorithm.
引用
收藏
页数:15
相关论文
共 45 条
  • [31] Pivot-Based Similarity Wide-Joins Fostering Near-Duplicate Detection
    Carvalho, Luiz Olmes
    Dutra Santos, Lucio Fernandes
    Machado Traina, Agma Juci
    Traina, Caetano, Jr.
    ENTERPRISE INFORMATION SYSTEMS, ICEIS 2016, 2017, 291 : 81 - 104
  • [32] Near-duplicate detection using a new framework of constructing accurate affine invariant regions
    Tian, Li
    Kamata, Sei-Ichiro
    ADVANCES IN VISUAL INFORMATION SYSTEMS, 2007, 4781 : 61 - 72
  • [33] Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
    Kumar, J. Prasanna
    Govindarajulu, P.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2013, 6 (01) : 1 - 13
  • [34] XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning
    Pamulaparty, Lavanya
    Rao, C. V. Guru
    Rao, M. Sreenivasa
    INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION AND CONVERGENCE (ICCC 2015), 2015, 48 : 228 - 235
  • [35] Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection
    Zhao, Wan-Lei
    Ngo, Chong-Wah
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2009, 18 (02) : 412 - 423
  • [36] Secure real-time image protection scheme with near-duplicate detection in cloud computing
    Dengzhi Liu
    Jian Shen
    Anxi Wang
    Chen Wang
    Journal of Real-Time Image Processing, 2020, 17 : 175 - 184
  • [37] Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
    J. Prasanna Kumar
    P. Govindarajulu
    International Journal of Computational Intelligence Systems, 2013, 6 : 1 - 13
  • [38] Secure real-time image protection scheme with near-duplicate detection in cloud computing
    Liu, Dengzhi
    Shen, Jian
    Wang, Anxi
    Wang, Chen
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2020, 17 (01) : 175 - 184
  • [39] News Topic Tracking and Re-ranking with Query Expansion Based on Near-Duplicate Detection
    Wu, Xiaomeng
    Ide, Ichiro
    Satoh, Shin'ichi
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2009, 2009, 5879 : 755 - +
  • [40] KEYPOINT-BASED NEAR-DUPLICATE IMAGES DETECTION USING AFFINE INVARIANT FEATURE AND COLOR MATCHING
    Wang, Yue
    Hou, ZuJun
    Leman, Karianto
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 1209 - 1212