Pivot-Based Similarity Wide-Joins Fostering Near-Duplicate Detection

被引:0
作者
Carvalho, Luiz Olmes [1 ,2 ]
Dutra Santos, Lucio Fernandes [1 ,3 ]
Machado Traina, Agma Juci [1 ]
Traina, Caetano, Jr. [1 ]
机构
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP, Brazil
[2] Fed Inst Minas Gerais, Belo Horizonte, MG, Brazil
[3] Fed Inst North Minas Gerais, Montes Claros, MG, Brazil
来源
ENTERPRISE INFORMATION SYSTEMS, ICEIS 2016 | 2017年 / 291卷
基金
巴西圣保罗研究基金会;
关键词
Similarity search; Similarity join; Query operators; Wide-join; Near-duplicate detection;
D O I
10.1007/978-3-319-62386-3_4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%.
引用
收藏
页码:81 / 104
页数:24
相关论文
共 49 条
[31]   News Topic Tracking and Re-ranking with Query Expansion Based on Near-Duplicate Detection [J].
Wu, Xiaomeng ;
Ide, Ichiro ;
Satoh, Shin'ichi .
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2009, 2009, 5879 :755-+
[32]   Consumer photo management and browsing facilitated by near-duplicate detection with feature filtering [J].
Chu, Wei-Ta ;
Lin, Chia-Hung .
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (03) :256-268
[33]   NEAR-DUPLICATE KEYFRAME IDENTIFICATION BASED ON COLOR AND AFFINE INVARIANT FEATURES [J].
Wang, Yue ;
Hou, Zujun ;
Chang, Richard ;
Chua, Teck Wee .
2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, :2361-2364
[34]   Aggregating Sentence-level Features for Chinese Near-duplicate Document Detection [J].
Liang, Yan ;
Tao, Yizheng ;
Feng, Ning ;
Wan, Zhenjing ;
Xu, Feng ;
Jiang, Xue ;
Gao, Shan .
PROCEEDINGS OF THE 2017 IEEE 14TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC 2017), 2017, :174-179
[35]   Explaining BERT model decisions for near-duplicate news article detection based on named entity recognition [J].
Novo, Anne Stockem ;
Gedikli, Fatih .
2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, :278-281
[36]   KEYPOINT-BASED NEAR-DUPLICATE IMAGES DETECTION USING AFFINE INVARIANT FEATURE AND COLOR MATCHING [J].
Wang, Yue ;
Hou, ZuJun ;
Leman, Karianto .
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, :1209-1212
[37]   All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection [J].
Fellah, Aziz .
ARRAY, 2021, 11
[38]   PhotoCluster A Multi-clustering Technique for Near-duplicate Detection in Personal Photo Collections [J].
Vonikakis, Vassilios ;
Jinda-Apiraksa, Amornched ;
Winkler, Stefan .
PROCEEDINGS OF THE 2014 9TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, THEORY AND APPLICATIONS (VISAPP 2014), VOL 2, 2014, :153-161
[39]   A new similarity measure for near duplicate video clip detection [J].
Zhou, Xiangmin ;
Zhou, Xiaofang ;
Shen, Heng Tao .
ADVANCES IN DATA AND WEB MANAGEMENT, PROCEEDINGS, 2007, 4505 :176-+
[40]   Near-duplicate detection using a new framework of constructing accurate affine invariant regions [J].
Tian, Li ;
Kamata, Sei-Ichiro .
ADVANCES IN VISUAL INFORMATION SYSTEMS, 2007, 4781 :61-72