Improving OCR for an Under-Resourced Script Using Unsupervised Word-Spotting

被引:0
作者
Silberpfennig, Adi [1 ]
Wolf, Lior [1 ]
Dershowitz, Nachum [1 ]
Bhagesh, Seraogi [2 ]
Chaudhuri, Bidyut B. [2 ]
机构
[1] Tel Aviv Univ, Blavatnik Sch Comp Sci, Tel Aviv, Israel
[2] Indian Stat Inst, Kolkata, India
来源
2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR) | 2015年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Optical character recognition (OCR) quality, especially for under-resourced scripts like Bangia, as well as for documents printed in old typefaces, is a major concern. An efficient and effective pipeline for OCR betterment is proposed here. The method is unsupervised. It employs a baseline OCR engine as a black box plus a dataset of unlabeled document images. That engine is applied to the images, followed by a visual encoding designed to support efficient word spotting. Given a new document to be analyzed, the black-box recognition engine is first applied. Then, for each result, word spotting is carried out within the dataset. The unreliable OCR outputs of the retrieved word spotting results are then considered. The word that is the centroid of the set of OCR words, measured by edit distance, is deemed a candidate reading.
引用
收藏
页码:706 / 710
页数:5
相关论文
共 50 条
  • [31] Improving the instructional leadership of heads of department in under-resourced schools: A collaborative action-learning approach
    Seobi, Boitshepo Audrey
    Wood, Lesley
    SOUTH AFRICAN JOURNAL OF EDUCATION, 2016, 36 (04)
  • [32] Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
    van Vuren, Joshua Jansen
    Niesler, Thomas
    LANGUAGES, 2022, 7 (03)
  • [33] Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages
    Agenbag, Wiehan
    Niesler, Thomas
    COMPUTER SPEECH AND LANGUAGE, 2019, 57 : 20 - 40
  • [34] USING KL-DIVERGENCE AND MULTILINGUAL INFORMATION TO IMPROVE ASR FOR UNDER-RESOURCED LANGUAGES
    Imseng, David
    Bourlard, Herve
    Garner, Philip N.
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4869 - 4872
  • [35] Multi-Task Learning using Mismatched Transcription for Under-Resourced Speech Recognition
    Van Hai Do
    Chen, Nancy E.
    Lim, Boon Pang
    Hasegawa-Johnson, Mark
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 734 - 738
  • [36] Rheumatic Care in Under-Resourced Areas Using the Extension for Community Healthcare Outcomes Model
    Bankhurst, Arthur
    Romero-Olivas, Cynthia
    Hernandez Larson, Jessica
    Bradford, Andrea
    Fields, Roderick
    Kalishman, Summers
    Marquez, Marisa
    Gonzales-Van Horn, Sarah
    Jones, Jessica
    Burke, Tom
    Snead, Jennifer
    Arora, Sanjeev
    ARTHRITIS CARE & RESEARCH, 2020, 72 (06) : 850 - 858
  • [37] Using Resource-Rich Languages to Improve Morphological Analysis of Under-Resourced Languages
    Baumann, Peter
    Pierrehumbert, Janet
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3355 - 3359
  • [38] Using machine learning to build POS tagger for under-resourced language: the case of Somali
    Mohammed S.
    International Journal of Information Technology, 2020, 12 (3) : 717 - 729
  • [39] Improving word coverage using unsupervised morphological analyser
    Sunitha, K. V. N.
    Kalyani, N.
    SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2009, 34 (05): : 703 - 715
  • [40] Improving word coverage using unsupervised morphological analyser
    K. V. N. Sunitha
    N. Kalyani
    Sadhana, 2009, 34 : 703 - 715