Pivot-Based Similarity Wide-Joins Fostering Near-Duplicate Detection

被引:0
作者
Carvalho, Luiz Olmes [1 ,2 ]
Dutra Santos, Lucio Fernandes [1 ,3 ]
Machado Traina, Agma Juci [1 ]
Traina, Caetano, Jr. [1 ]
机构
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP, Brazil
[2] Fed Inst Minas Gerais, Belo Horizonte, MG, Brazil
[3] Fed Inst North Minas Gerais, Montes Claros, MG, Brazil
来源
ENTERPRISE INFORMATION SYSTEMS, ICEIS 2016 | 2017年 / 291卷
基金
巴西圣保罗研究基金会;
关键词
Similarity search; Similarity join; Query operators; Wide-join; Near-duplicate detection;
D O I
10.1007/978-3-319-62386-3_4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%.
引用
收藏
页码:81 / 104
页数:24
相关论文
共 49 条
[41]   XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning [J].
Pamulaparty, Lavanya ;
Rao, C. V. Guru ;
Rao, M. Sreenivasa .
INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION AND CONVERGENCE (ICCC 2015), 2015, 48 :228-235
[42]   Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting [J].
Kumar, J. Prasanna ;
Govindarajulu, P. .
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2013, 6 (01) :1-13
[43]   LEVERAGING AN IMAGE FOLKSONOMY AND THE SIGNATURE QUADRATIC FORM DISTANCE FOR SEMANTIC-BASED DETECTION OF NEAR-DUPLICATE VIDEO CLIPS [J].
Min, Hyun-seok ;
Choi, Jae Young ;
Neve, Wesley De ;
Ro, Yong Man .
2011 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2011,
[44]   Gradient Ordinal Signature and Fixed-Point Embedding for Efficient Near-Duplicate Video Detection [J].
Liu, Hong ;
Lu, Hong ;
Wen, Zhaohui ;
Xue, Xiangyang .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2012, 22 (04) :555-566
[45]   Secure real-time image protection scheme with near-duplicate detection in cloud computing [J].
Dengzhi Liu ;
Jian Shen ;
Anxi Wang ;
Chen Wang .
Journal of Real-Time Image Processing, 2020, 17 :175-184
[46]   Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting [J].
J. Prasanna Kumar ;
P. Govindarajulu .
International Journal of Computational Intelligence Systems, 2013, 6 :1-13
[47]   Secure real-time image protection scheme with near-duplicate detection in cloud computing [J].
Liu, Dengzhi ;
Shen, Jian ;
Wang, Anxi ;
Wang, Chen .
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2020, 17 (01) :175-184
[48]   TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection [J].
Wang, Zhizhi ;
Zuo, Chaoji ;
Deng, Dong .
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, :1146-1159
[49]   Near duplicate detection of images with area and proposed pixel-based feature extraction [J].
Governor, Kalaiarasi ;
Ramanujam, Padmavathy ;
Mana, Suja Cherukullapurath ;
Perumal, Geetha .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (02)