Search for Near-Duplicate Handwritten Documents for Data-Intensive Applications

被引:0
作者
Varlamova, K. D. [1 ,2 ]
Kaprielova, M. S. [1 ,2 ,3 ]
Potyashin, I. O. [1 ,2 ]
Chekhovich, Yu. V. [1 ]
机构
[1] Antiplagiat Co, Moscow, Russia
[2] Moscow Inst Phys & Technol, Dolgoprudnyi 141701, Moscow Oblast, Russia
[3] Russian Acad Sci, Fed Res Ctr Comp Sci & Control, Moscow 119333, Russia
关键词
computer vision; near-duplicate detection; handwritten document analysis; large collections; Russian cursive; PLAGIARISM; FEATURES;
D O I
10.1134/S1064230724700503
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of cheating in handwritten academic essays has become more significant over the past few years. One type of cheating involves submitting the same paper, photographed in a different environment (for example, from another angle, in a different light, or in lower quality) or changed by automatic augmentation. The existing methods for detecting near-duplicates are not designed to work on large collections of handwritten documents, which significantly limits their use in practice. A machine learning-based method is presented that enables the detection of near-duplicate handwritten text images among large collections of potential sources. The proposed approach consists of three stages: converting the image into a vector representation, searching for candidates, and then selecting the source of duplication among the candidates. Our method achieved 80% and 59% recall-at-1 with false positive rate of 4.8% and 5.5% on Synthetic and Real data, respectively. The search latency is 5.5 seconds per query for a collection of 10 000 images. The results showed that the developed method is sufficiently robust to solve problems that require checking large collections of handwritten documents for cheating.
引用
收藏
页码:687 / 694
页数:8
相关论文
共 3 条
  • [1] INDEXING NEAR-DUPLICATE IMAGES IN WEB SEARCH USING MINHASH ALGORITHM
    Thaiyalnayaki, S.
    Sasikala, J.
    Ponraj, R.
    MATERIALS TODAY-PROCEEDINGS, 2018, 5 (01) : 1943 - 1949
  • [2] Squirrel Search Optimization-based near-duplicate image detection
    Sundaram, Srinidhi
    Somasundaram, Kamalakkannan
    Jayaraman, Sasikala
    IMAGING SCIENCE JOURNAL, 2024,
  • [3] TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection
    Wang, Zhizhi
    Zuo, Chaoji
    Deng, Dong
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1146 - 1159