Improving OCR for an Under-Resourced Script Using Unsupervised Word-Spotting

被引:0
作者
Silberpfennig, Adi [1 ]
Wolf, Lior [1 ]
Dershowitz, Nachum [1 ]
Bhagesh, Seraogi [2 ]
Chaudhuri, Bidyut B. [2 ]
机构
[1] Tel Aviv Univ, Blavatnik Sch Comp Sci, Tel Aviv, Israel
[2] Indian Stat Inst, Kolkata, India
来源
2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR) | 2015年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Optical character recognition (OCR) quality, especially for under-resourced scripts like Bangia, as well as for documents printed in old typefaces, is a major concern. An efficient and effective pipeline for OCR betterment is proposed here. The method is unsupervised. It employs a baseline OCR engine as a black box plus a dataset of unlabeled document images. That engine is applied to the images, followed by a visual encoding designed to support efficient word spotting. Given a new document to be analyzed, the black-box recognition engine is first applied. Then, for each result, word spotting is carried out within the dataset. The unreliable OCR outputs of the retrieved word spotting results are then considered. The word that is the centroid of the set of OCR words, measured by edit distance, is deemed a candidate reading.
引用
收藏
页码:706 / 710
页数:5
相关论文
共 50 条
  • [41] Using closely-related language to build an ASR for a very under-resourced language: Iban
    Juan, Sarah Samson
    Besacier, Laurent
    Lecouteux, Benjamin
    Tan, Tien-Ping
    [J]. 2014 17TH ORIENTAL CHAPTER OF THE INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDIZATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (COCOSDA), 2014,
  • [42] Using different acoustic, lexical and language modeling units for ASR of an under-resourced language - Amharic
    Tachbelie, Martha Yifiru
    Abate, Solomon Teferra
    Besacier, Laurent
    [J]. SPEECH COMMUNICATION, 2014, 56 : 181 - 194
  • [43] POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages
    Khemakhem, Ines Turki
    Jamoussi, Salma
    Ben Hamadou, Abdelmajid
    [J]. COMPUTACION Y SISTEMAS, 2016, 20 (04): : 667 - 679
  • [44] Improving Under-Resourced Code-Switched Speech Recognition: Large Pre-trained Models or Architectural Interventions
    van Vuren, Joshua Jansen
    Niesler, Thomas
    [J]. INTERSPEECH 2023, 2023, : 1439 - 1443
  • [45] Using patient navigation to inform determinants of breast cancer disparities among under-resourced women in Chicago
    Henderson, Vida
    Watson, Karriem
    Tossas-Milligan, Kathy
    Martinez, Erica
    Rodriguez, Mariela
    Williams, Barbara
    Torres, Paola
    Aponte-Soto, Lisa
    Winn, Robert
    [J]. CANCER EPIDEMIOLOGY BIOMARKERS & PREVENTION, 2020, 29 (06)
  • [46] Twitter Sentiment Analysis in Under-Resourced Languages using Byte-Level Recurrent Neural Model
    Ferdiana, Ridi
    Fajar, Wiliam
    Purwanti, Desi Dwi
    Ayu, Artmita Sekar Tri
    Jatmiko, Fahim
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (08) : 108 - 112
  • [47] Improving Word Spotting System Performance Using Ensemble Classifier Combination Methods
    Khayyat, Muna
    Suen, Ching Y.
    [J]. PROCEEDINGS 2018 16TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2018, : 229 - 234
  • [48] Assisting non-expert speakers of under-resourced languages in assigning stems and inflectional paradigms to new word entries of morphological dictionaries
    Miquel Esplà-Gomis
    Rafael C. Carrasco
    Víctor M. Sánchez-Cartagena
    Mikel L. Forcada
    Felipe Sánchez-Martínez
    Juan Antonio Pérez-Ortiz
    [J]. Language Resources and Evaluation, 2017, 51 : 989 - 1017
  • [49] Assisting non-expert speakers of under-resourced languages in assigning stems and inflectional paradigms to new word entries of morphological dictionaries
    Espla-Gomis, Miquel
    Carrasco, Rafael C.
    Sanchez-Cartagena, Victor M.
    Forcada, Mikel L.
    Sanchez-Martinez, Felipe
    Antonio Perez-Ortiz, Juan
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) : 989 - 1017
  • [50] Pivotal Partnerships: Improving Access to Early Identification and Intervention for Toddlers at Risk for Autism Spectrum Disorder (ASD) in Under-resourced Communities
    Foster, Tori
    Fleck, Mary
    Hine, Jeffrey
    Nicholson, Amy
    Simcoe, Kathleen
    Spiess, Amanda
    Stainbrook, Alacia
    Juarez, Pablo
    Warren, Zachary
    [J]. JOURNAL OF DEVELOPMENTAL AND BEHAVIORAL PEDIATRICS, 2022, 43 (02) : E121 - E122