Improving OCR for an Under-Resourced Script Using Unsupervised Word-Spotting

被引：0

作者：

Silberpfennig, Adi ^{[1
]}

Wolf, Lior ^{[1
]}

Dershowitz, Nachum ^{[1
]}

Bhagesh, Seraogi ^{[2
]}

Chaudhuri, Bidyut B. ^{[2
]}

机构：

[1] Tel Aviv Univ, Blavatnik Sch Comp Sci, Tel Aviv, Israel

[2] Indian Stat Inst, Kolkata, India

来源：

2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR) | 2015年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Optical character recognition (OCR) quality, especially for under-resourced scripts like Bangia, as well as for documents printed in old typefaces, is a major concern. An efficient and effective pipeline for OCR betterment is proposed here. The method is unsupervised. It employs a baseline OCR engine as a black box plus a dataset of unlabeled document images. That engine is applied to the images, followed by a visual encoding designed to support efficient word spotting. Given a new document to be analyzed, the black-box recognition engine is first applied. Then, for each result, word spotting is carried out within the dataset. The unreliable OCR outputs of the retrieved word spotting results are then considered. The word that is the centroid of the set of OCR words, measured by edit distance, is deemed a candidate reading.

引用

页码：706 / 710

页数：5

共 50 条

[1] An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages
Ustalov, Dmitry
Teslenko, Denis
Panchenko, Alexander
Chernoskutov, Mikhail
Biemann, Chris
Ponzetto, Simone Paolo
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1018 - 1022
[2] Unsupervised writer adaptation of whole-word HMMs with application to word-spotting
Rodriguez-Serrano, Jose A.
Perronnin, Florent
Sanchez, Gemma
Llados, Josep
PATTERN RECOGNITION LETTERS, 2010, 31 (08) : 742 - 749
[3] Unsupervised visualization of Under-resourced speech prosody
Ekpenyong, Moses
Inyang, Udoinyang
Udoh, EmemObong
SPEECH COMMUNICATION, 2018, 101 : 45 - 56
[4] Text Spotting In Large Speech Databases For Under-Resourced Languages
Buzo, Andi
Cucu, Horia
Burileanu, Corneliu
2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
[5] OCR-independent and Segmentation-free Word-Spotting in Handwritten Arabic Archive Documents
Aouadi, N.
Kacem, A.
2013 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND SOFTWARE APPLICATIONS (ICEESA), 2013, : 36 - 41
[6] Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities
Ahmadi, Sina
Anastasopoulos, Antonios
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14466 - 14487
[7] IMPROVING HMM/DNN IN ASR OF UNDER-RESOURCED LANGUAGES USING PROBABILISTIC SAMPLING
Song, Meixu
Zhang, Qingqing
Pan, Jielin
Yan, Yonghong
2015 IEEE CHINA SUMMIT & INTERNATIONAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING, 2015, : 20 - 24
[8] Unsupervised Mining of Under-resourced Speech Corpora for Tone Features Classification
Ekpenyong, Moses E.
Inyang, Udoinyang G.
2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 2374 - 2381
[9] Handwritten word-spotting using hidden Markov models and universal vocabularies
Rodriguez-Serrano, Jose A.
Perronnin, Florent
PATTERN RECOGNITION, 2009, 42 (09) : 2106 - 2116
[10] Word-length algorithm for language identification of under-resourced languages
Selamat, Ali
Akosu, Nicholas
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2016, 28 (04) : 457 - 469

← 1 2 3 4 5 →