Search for Near-Duplicate Handwritten Documents for Data-Intensive Applications

被引：0

作者：

Varlamova, K. D. ^{[1
,2
]}

Kaprielova, M. S. ^{[1
,2
,3
]}

Potyashin, I. O. ^{[1
,2
]}

Chekhovich, Yu. V. ^{[1
]}

机构：

[1] Antiplagiat Co, Moscow, Russia

[2] Moscow Inst Phys & Technol, Dolgoprudnyi 141701, Moscow Oblast, Russia

[3] Russian Acad Sci, Fed Res Ctr Comp Sci & Control, Moscow 119333, Russia

来源：

JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL | 2024年 / 63卷 / 04期

关键词：

computer vision; near-duplicate detection; handwritten document analysis; large collections; Russian cursive; PLAGIARISM; FEATURES;

D O I：

10.1134/S1064230724700503

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The problem of cheating in handwritten academic essays has become more significant over the past few years. One type of cheating involves submitting the same paper, photographed in a different environment (for example, from another angle, in a different light, or in lower quality) or changed by automatic augmentation. The existing methods for detecting near-duplicates are not designed to work on large collections of handwritten documents, which significantly limits their use in practice. A machine learning-based method is presented that enables the detection of near-duplicate handwritten text images among large collections of potential sources. The proposed approach consists of three stages: converting the image into a vector representation, searching for candidates, and then selecting the source of duplication among the candidates. Our method achieved 80% and 59% recall-at-1 with false positive rate of 4.8% and 5.5% on Synthetic and Real data, respectively. The search latency is 5.5 seconds per query for a collection of 10 000 images. The results showed that the developed method is sufficiently robust to solve problems that require checking large collections of handwritten documents for cheating.

引用

页码：687 / 694

页数：8

共 3 条

[1] INDEXING NEAR-DUPLICATE IMAGES IN WEB SEARCH USING MINHASH ALGORITHM
Thaiyalnayaki, S.
Sasikala, J.
Ponraj, R.
MATERIALS TODAY-PROCEEDINGS, 2018, 5 (01) : 1943 - 1949
[2] Squirrel Search Optimization-based near-duplicate image detection
Sundaram, Srinidhi
Somasundaram, Kamalakkannan
Jayaraman, Sasikala
IMAGING SCIENCE JOURNAL, 2024,
[3] TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection
Wang, Zhizhi
Zuo, Chaoji
Deng, Dong
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1146 - 1159

← 1 →