All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection

被引:2
作者
Fellah, Aziz [1 ]
机构
[1] Northwest Missouri State Univ, Sch Comp Sci & Informat Syst, Maryville, MO 64468 USA
关键词
Near-duplicate detection; Near-duplicates; Approximate duplicates; Clustering; Data mining applications and discovery; Data cleaning; RECORD-LINKAGE; METHODOLOGY;
D O I
10.1016/j.array.2021.100070
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we propose a general domain-independent approach called Merge-Filter Representative-based Clustering (Merge-Filter-RC) for detecting near-duplicate records within a single and across multiple data sources. Subsequently, we develop three near-optimal classes of algorithms: constant threshold (CT) variable threshold (VT) and function threshold (FT), which we collectively call All-Three algorithms. Merge-Filter-RC and All-Three mold the basis of this work. Merge-Filter-RC works recursively in the spirit of divide-merge fashion for distilling locally and globally near-duplicates as hierarchical clusters along with their prototype representatives. Each cluster is characterized by one or more representatives which are in turn refined dynamically. Representatives are used for further similarity comparisons to reduce the number of pairwise comparisons and consequently the search space. In addition, we segregate the results of the comparisons by labels which we refer to as very similar, similar, or not similar. We complement All-Three algorithms by a more thorough reexamination of the original well-tuned features of the seminal work of Monge-Elkan's (ME) algorithm which we circumvented by an affine variant of the Smith-Waterman's (SW) similarity measure. Using both real-world benchmarks and synthetically generated data sets, we performed several experiments and extensive analysis to show that All-Three algorithms which are rooted in the Merge-Filter-RC approach significantly outperform Monge-Elkan's algorithm in terms of accuracy in detecting near-duplicates. In addition, All-Three algorithms are as efficient in terms of computations as Monge-Elkan's algorithm.
引用
收藏
页数:15
相关论文
共 45 条
  • [11] Detection and Segmentation of Near-duplicate Fragments in Random Images
    Sluzek, Andrzej
    Paradowski, Mariusz
    Duanduan, Yang
    11TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2010), 2010, : 1161 - 1166
  • [12] Near-Duplicate Detection in Web App Model Inference
    Yandrapally, Rahulkrishna
    Stocco, Andrea
    Mesbah, Ali
    2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020), 2020, : 186 - 197
  • [13] Domain-Specific Keyphrase Extraction and Near-Duplicate Article Detection based on Ontology
    Nhon Do
    LongVan Ho
    2015 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES - RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2015, : 123 - 126
  • [14] An extended version of sectional MinHash method for near-duplicate detection
    Shayegan, Mohammad-Javad
    Faizollahi-Samarin, Mehdi
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (13) : 15638 - 15662
  • [15] An extended version of sectional MinHash method for near-duplicate detection
    Mohammad-Javad Shayegan
    Mehdi Faizollahi-Samarin
    The Journal of Supercomputing, 2022, 78 : 15638 - 15662
  • [16] Near-Duplicate Image Detection in a Visually Salient Riemannian Space
    Zheng, Ligang
    Lei, Yanqiang
    Qiu, Guoping
    Huang, Jiwu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (05) : 1578 - 1593
  • [17] EFFICIENT NEAR-DUPLICATE IMAGE DETECTION BY LEARNING FROM EXAMPLES
    Hu, Yang
    Li, Mingjing
    Yu, Nenghai
    2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 657 - +
  • [18] TOWARDS USING SEMANTIC FEATURES FOR NEAR-DUPLICATE VIDEO DETECTION
    Min, Hyun-seok
    De Neve, Wesley
    Ro, Yong Man
    2010 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2010), 2010, : 1364 - 1369
  • [19] NEAR-DUPLICATE DETECTION AND ALIGNMENT FOR MULTI-VIEW VIDEOS
    Melloni, A.
    Lameri, S.
    Bestagini, P.
    Tagliasacchi, M.
    Tubaro, S.
    2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 2444 - 2448
  • [20] QUERY ORIENTED SUBSPACE SHIFTING FOR NEAR-DUPLICATE IMAGE DETECTION
    Wu, Lei
    Liu, Jing
    Yu, Nenghai
    Li, Mingjing
    2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 661 - +