Pivot-Based Similarity Wide-Joins Fostering Near-Duplicate Detection

被引:0
作者
Carvalho, Luiz Olmes [1 ,2 ]
Dutra Santos, Lucio Fernandes [1 ,3 ]
Machado Traina, Agma Juci [1 ]
Traina, Caetano, Jr. [1 ]
机构
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP, Brazil
[2] Fed Inst Minas Gerais, Belo Horizonte, MG, Brazil
[3] Fed Inst North Minas Gerais, Montes Claros, MG, Brazil
来源
ENTERPRISE INFORMATION SYSTEMS, ICEIS 2016 | 2017年 / 291卷
基金
巴西圣保罗研究基金会;
关键词
Similarity search; Similarity join; Query operators; Wide-join; Near-duplicate detection;
D O I
10.1007/978-3-319-62386-3_4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%.
引用
收藏
页码:81 / 104
页数:24
相关论文
共 49 条
  • [21] An extended version of sectional MinHash method for near-duplicate detection
    Shayegan, Mohammad-Javad
    Faizollahi-Samarin, Mehdi
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (13) : 15638 - 15662
  • [22] An extended version of sectional MinHash method for near-duplicate detection
    Mohammad-Javad Shayegan
    Mehdi Faizollahi-Samarin
    The Journal of Supercomputing, 2022, 78 : 15638 - 15662
  • [23] Near-Duplicate Image Detection in a Visually Salient Riemannian Space
    Zheng, Ligang
    Lei, Yanqiang
    Qiu, Guoping
    Huang, Jiwu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (05) : 1578 - 1593
  • [24] TOWARDS USING SEMANTIC FEATURES FOR NEAR-DUPLICATE VIDEO DETECTION
    Min, Hyun-seok
    De Neve, Wesley
    Ro, Yong Man
    2010 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2010), 2010, : 1364 - 1369
  • [25] EFFICIENT NEAR-DUPLICATE IMAGE DETECTION BY LEARNING FROM EXAMPLES
    Hu, Yang
    Li, Mingjing
    Yu, Nenghai
    2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 657 - +
  • [26] Domain-Specific Keyphrase Extraction and Near-Duplicate Article Detection based on Ontology
    Nhon Do
    LongVan Ho
    2015 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES - RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2015, : 123 - 126
  • [27] Filtering Image Spam using Image Semantics and Near-Duplicate Detection
    Qu, Zhaoyang
    Zhang, Yingjin
    ICICTA: 2009 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTATION TECHNOLOGY AND AUTOMATION, VOL I, PROCEEDINGS, 2009, : 600 - 603
  • [28] Apollo: Near-Duplicate Detection for Job Ads in the Online Recruitment Domain
    Burk, Hunter
    Javed, Faizan
    Balaji, Janani
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2017), 2017, : 177 - 182
  • [29] Near-Duplicate Web Page Detection by Enhanced TDW and simHash Technique
    Arun, P. R.
    Sumesh, M. S.
    2015 INTERNATIONAL CONFERENCE ON COMPUTING AND NETWORK COMMUNICATIONS (COCONET), 2015, : 765 - 770
  • [30] Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection
    Zhao, Wan-Lei
    Ngo, Chong-Wah
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2009, 18 (02) : 412 - 423