Improving OCR for an Under-Resourced Script Using Unsupervised Word-Spotting

被引:0
|
作者
Silberpfennig, Adi [1 ]
Wolf, Lior [1 ]
Dershowitz, Nachum [1 ]
Bhagesh, Seraogi [2 ]
Chaudhuri, Bidyut B. [2 ]
机构
[1] Tel Aviv Univ, Blavatnik Sch Comp Sci, Tel Aviv, Israel
[2] Indian Stat Inst, Kolkata, India
来源
2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR) | 2015年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Optical character recognition (OCR) quality, especially for under-resourced scripts like Bangia, as well as for documents printed in old typefaces, is a major concern. An efficient and effective pipeline for OCR betterment is proposed here. The method is unsupervised. It employs a baseline OCR engine as a black box plus a dataset of unlabeled document images. That engine is applied to the images, followed by a visual encoding designed to support efficient word spotting. Given a new document to be analyzed, the black-box recognition engine is first applied. Then, for each result, word spotting is carried out within the dataset. The unreliable OCR outputs of the retrieved word spotting results are then considered. The word that is the centroid of the set of OCR words, measured by edit distance, is deemed a candidate reading.
引用
收藏
页码:706 / 710
页数:5
相关论文
共 50 条
  • [1] An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages
    Ustalov, Dmitry
    Teslenko, Denis
    Panchenko, Alexander
    Chernoskutov, Mikhail
    Biemann, Chris
    Ponzetto, Simone Paolo
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1018 - 1022
  • [2] Unsupervised writer adaptation of whole-word HMMs with application to word-spotting
    Rodriguez-Serrano, Jose A.
    Perronnin, Florent
    Sanchez, Gemma
    Llados, Josep
    PATTERN RECOGNITION LETTERS, 2010, 31 (08) : 742 - 749
  • [3] Unsupervised visualization of Under-resourced speech prosody
    Ekpenyong, Moses
    Inyang, Udoinyang
    Udoh, EmemObong
    SPEECH COMMUNICATION, 2018, 101 : 45 - 56
  • [4] Text Spotting In Large Speech Databases For Under-Resourced Languages
    Buzo, Andi
    Cucu, Horia
    Burileanu, Corneliu
    2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
  • [5] OCR-independent and Segmentation-free Word-Spotting in Handwritten Arabic Archive Documents
    Aouadi, N.
    Kacem, A.
    2013 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND SOFTWARE APPLICATIONS (ICEESA), 2013, : 36 - 41
  • [6] Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities
    Ahmadi, Sina
    Anastasopoulos, Antonios
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14466 - 14487
  • [7] IMPROVING HMM/DNN IN ASR OF UNDER-RESOURCED LANGUAGES USING PROBABILISTIC SAMPLING
    Song, Meixu
    Zhang, Qingqing
    Pan, Jielin
    Yan, Yonghong
    2015 IEEE CHINA SUMMIT & INTERNATIONAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING, 2015, : 20 - 24
  • [8] Unsupervised Mining of Under-resourced Speech Corpora for Tone Features Classification
    Ekpenyong, Moses E.
    Inyang, Udoinyang G.
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 2374 - 2381
  • [9] Handwritten word-spotting using hidden Markov models and universal vocabularies
    Rodriguez-Serrano, Jose A.
    Perronnin, Florent
    PATTERN RECOGNITION, 2009, 42 (09) : 2106 - 2116
  • [10] Word-length algorithm for language identification of under-resourced languages
    Selamat, Ali
    Akosu, Nicholas
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2016, 28 (04) : 457 - 469