Pivot-Based Similarity Wide-Joins Fostering Near-Duplicate Detection

被引:0
作者
Carvalho, Luiz Olmes [1 ,2 ]
Dutra Santos, Lucio Fernandes [1 ,3 ]
Machado Traina, Agma Juci [1 ]
Traina, Caetano, Jr. [1 ]
机构
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP, Brazil
[2] Fed Inst Minas Gerais, Belo Horizonte, MG, Brazil
[3] Fed Inst North Minas Gerais, Montes Claros, MG, Brazil
来源
ENTERPRISE INFORMATION SYSTEMS, ICEIS 2016 | 2017年 / 291卷
基金
巴西圣保罗研究基金会;
关键词
Similarity search; Similarity join; Query operators; Wide-join; Near-duplicate detection;
D O I
10.1007/978-3-319-62386-3_4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%.
引用
收藏
页码:81 / 104
页数:24
相关论文
共 49 条
  • [1] Efficient Self-similarity Range Wide-joins Fostering Near-duplicate Image Detection in Emergency Scenarios
    Carvalho, Luiz Olmes
    Santos, Lucio F. D.
    Oliveira, Willian D.
    Traina, Agma J. M.
    Traina, Caetano, Jr.
    PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 1 (ICEIS), 2016, : 81 - 91
  • [2] Self Similarity Wide-Joins for Near-Duplicate Image Detection
    Carvalho, Luiz Olmes
    Santos, Lucio F. D.
    Oliveira, Willian D.
    Traina, Agma J. M.
    Traina, Caetano, Jr.
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 237 - 240
  • [3] Efficient Similarity Joins for Near-Duplicate Detection
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    Yu, Jeffrey Xu
    Wang, Guoren
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2011, 36 (03):
  • [4] Adaptive Near-Duplicate Detection via Similarity Learning
    Hajishirzi, Hannaneh
    Yih, Wen-tau
    Kolcz, Aleksander
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 419 - 426
  • [5] Sectional MinHash for near-duplicate detection
    Hassanian-esfahani, Roya
    Kargar, Mohammad-javad
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 99 : 203 - 212
  • [6] On Efficient Content-based Near-duplicate Video Detection
    Uysal, Merih Seran
    Beecks, Christian
    Seidl, Thomas
    2015 13TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2015,
  • [7] Online Near-Duplicate Detection of News Articles
    Rodier, Simon
    Carter, Dave
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1242 - 1249
  • [8] Video Query Reformulation for Near-Duplicate Detection
    Chiu, Chih-Yi
    Li, Sheng-Yang
    Hsieh, Cheng-Yu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (05) : 1594 - 1603
  • [9] Analysis of Neural Codes for Near-Duplicate Detection
    Pintus, Maurizio
    ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2018, 2018, 11182 : 357 - 368
  • [10] Benchmarking unsupervised near-duplicate image detection
    Morra, Lia
    Lamberti, Fabrizio
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 135 : 313 - 326