Multi-font printed Mongolian document recognition system

被引：22

作者：

Peng, Liangrui ^{[1
,2
,3
]}

Liu, Changsong ^{[1
,2
,3
]}

Ding, Xiaoqing ^{[1
,2
,3
]}

Jin, Jianming ^{[4
]}

Wu, Youshou ^{[1
,2
,3
]}

Wang, Hua ^{[1
,2
,3
]}

Bao, Yanhua ^{[5
]}

机构：

[1] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China

[2] Tsinghua Univ, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China

[3] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China

[4] HP Labs China, Beijing 100084, Peoples R China

[5] Hulunbeier Coll, Mongolian Dept, Hailar 021008, Inner Mongolia, Peoples R China

来源：

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION | 2010年 / 13卷 / 02期

关键词：

Multi-font Mongolian; Character recognition; Character segmentation; Mixed script;

D O I：

10.1007/s10032-009-0106-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.

引用

页码：93 / 106

页数：14

共 50 条

[31] DISPLAY ACCELERATION METHODS FOR DRAWING MULTI-FONT DOCUMENTS IN INTERACTIVE EDITING
YOSHIDA, S
NOMURA, T
MAMEDA, K
IAI, F
KUBO, N
SHARP TECHNICAL JOURNAL, 1991, (48): : 31 - 36
[32] Script Identification of Pre-Segmented Multi-Font Characters and Digits
Rani, Rajneesh
Dhir, Renu
Lehal, Gurpreet Singh
2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 1150 - 1154
[33] Font and function word identification in document recognition
Khoubyari, S
Hull, JJ
COMPUTER VISION AND IMAGE UNDERSTANDING, 1996, 63 (01) : 66 - 74
[34] ON THE RECOGNITION OF PRINTED CHARACTERS OF ANY FONT AND SIZE
KAHAN, S
PAVLIDIS, T
BAIRD, HS
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1987, 9 (02) : 274 - 288
[35] ICDAR2013 Competition on Multi-font and Multi-size Digitally Represented Arabic Text
Slimane, Fouad
Kanoun, Slim
El Abed, Haikal
Alimi, Adel M.
Ingold, Rolf
Hennebert, Jean
2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 1433 - 1437
[36] A new hybrid method for Arabic multi-font text segmentation, and a reference corpus construction
Zoizou, Abdelhay
Zarghili, Arsalane
Chaker, Ilham
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2020, 32 (05) : 576 - 582
[37] ICDAR2017 Competition on Multi-font and Multi-size Digitally Represented Arabic Text
Slimane, Fouad
Ingold, Rolf
Hennebert, Jean
2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1466 - 1472
[38] FONT FINDER: VISUAL RECOGNITION OF TYPEFACE IN PRINTED DOCUMENTS
Bui, Tu
Collomosse, John
2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 3926 - 3930
[39] Classical Mongolian Words Recognition in Historical Document
Gao, Guanglai
Su, Xiangdong
Wei, Hongxi
Gong, Yeyun
11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 692 - 697
[40] Optical english font recognition in document images using eigenfaces
Al-Khaffaf, Hasan S. M.
Musa, Nadia A.
REVISTA INNOVACIENCIA, 2018, 6 (01):

← 1 2 3 4 5 →