Near-duplicate document detection with improved similarity measurement

被引:2
|
作者
Yuan Xin-pan [1 ]
Long Jun [1 ]
Zhang Zu-ping [1 ]
Gui Wei-hua [1 ]
机构
[1] Cent S Univ, Sch Informat Sci & Engn, Changsha 410083, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
similarity estimation; near-duplicate document detection; fingerprint group; Hamming distance; minwise hashing;
D O I
10.1007/s11771-012-1267-z
中图分类号
TF [冶金工业];
学科分类号
0806 ;
摘要
To quickly find documents with high similarity in existing documentation sets, fingerprint group merging retrieval algorithm is proposed to address both sides of the problem: a given similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. It can be proved that the efficiency of similarity retrieval is improved by fingerprint group merging retrieval algorithm with lower similarity threshold. Experiments with the lower similarity threshold r=0.7 and high fingerprint bits k=400 demonstrate that the CPU time-consuming cost decreases from 1 921 s to 273 s. Theoretical analysis and experimental results verify the effectiveness of this method.
引用
收藏
页码:2231 / 2237
页数:7
相关论文
共 50 条
  • [1] Near-duplicate document detection with improved similarity measurement
    袁鑫攀
    龙军
    张祖平
    桂卫华
    JournalofCentralSouthUniversity, 2012, 19 (08) : 2231 - 2237
  • [2] Near-duplicate document detection with improved similarity measurement
    Xin-pan Yuan
    Jun Long
    Zu-ping Zhang
    Wei-hua Gui
    Journal of Central South University, 2012, 19 : 2231 - 2237
  • [3] Efficient Similarity Joins for Near-Duplicate Detection
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    Yu, Jeffrey Xu
    Wang, Guoren
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2011, 36 (03):
  • [4] Efficient Near-Duplicate Document Detection using FPGAs
    Luo, Xi
    Najjar, Walid
    Hristidis, Vagelis
    2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [5] Deep Learning in the Domain of Near-Duplicate Document Detection
    Roul, Rajendra Kumar
    BIG DATA ANALYTICS (BDA 2019), 2019, 11932 : 439 - 459
  • [6] Adaptive Near-Duplicate Detection via Similarity Learning
    Hajishirzi, Hannaneh
    Yih, Wen-tau
    Kolcz, Aleksander
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 419 - 426
  • [7] Exploiting Sentence-Level Features for Near-Duplicate Document Detection
    Wang, Jenq-Haur
    Chang, Hung-Chi
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2009, 5839 : 205 - +
  • [8] Self Similarity Wide-Joins for Near-Duplicate Image Detection
    Carvalho, Luiz Olmes
    Santos, Lucio F. D.
    Oliveira, Willian D.
    Traina, Agma J. M.
    Traina, Caetano, Jr.
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 237 - 240
  • [9] Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection
    Phuc-Tran Ho
    Kim, Sung-Ryul
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2014,
  • [10] New issues in near-duplicate detection
    Potthast, Martin
    Stein, Benno
    DATA ANALYSIS, MACHINE LEARNING AND APPLICATIONS, 2008, : 601 - 609