Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents

被引:0
|
作者
Obaidullah, Sk Md [1 ]
Santosh, K. C. [2 ]
Halder, Chayan [3 ]
Das, Nibaran [4 ]
Roy, Kaushik [3 ]
机构
[1] Aliah Univ Kolkata, Dept Comp Sci & Engn, Kolkata, W Bengal, India
[2] Univ South Dakota, Dept Comp Sci, Vermillion, SD 57069 USA
[3] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata, India
[4] West Bengal State Univ, Dept Comp Sci, Kolkata, India
关键词
Multi-script documents; Official indic script database; Script identification;
D O I
10.1007/978-981-10-4859-3_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).
引用
收藏
页码:16 / 27
页数:12
相关论文
共 50 条
  • [21] Word-Level Script Identification from Scene Images
    Fasil, O. K.
    Manjunath, S.
    Aradhya, V. N. Manjunath
    PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON FRONTIERS IN INTELLIGENT COMPUTING: THEORY AND APPLICATIONS, (FICTA 2016), VOL 2, 2017, 516 : 417 - 426
  • [22] Improved Shape Code Based Word Matching For Multi-script Documents
    Mondal, Tanmoy
    Tarafdar, Arundhati
    Ragot, Nicolas
    Ramel, Jean-Yves
    Pal, Umapada
    PROCEEDINGS 3RD IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION ACPR 2015, 2015, : 181 - 185
  • [23] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Ferrer, Miguel A.
    Das, Abhijit
    Diaz, Moises
    Morales, Aythami
    Carmona-Duarte, Cristina
    Pal, Umapada
    arXiv,
  • [24] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Ferrer, Miguel A.
    Das, Abhijit
    Diaz, Moises
    Morales, Aythami
    Carmona-Duarte, Cristina
    Pal, Umapada
    COGNITIVE COMPUTATION, 2024, 16 (01) : 131 - 157
  • [25] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Miguel A. Ferrer
    Abhijit Das
    Moises Diaz
    Aythami Morales
    Cristina Carmona-Duarte
    Umapada Pal
    Cognitive Computation, 2024, 16 (1) : 131 - 157
  • [26] WORD-LEVEL RECOGNITION OF CURSIVE SCRIPT
    FARAG, RFH
    IEEE TRANSACTIONS ON COMPUTERS, 1979, 28 (02) : 172 - 175
  • [27] Word-Level Script Identification Using Texture Based Features
    Singh, Pawan Kumar
    Sarkar, Ram
    Nasipuri, Mita
    INTERNATIONAL JOURNAL OF SYSTEM DYNAMICS APPLICATIONS, 2015, 4 (02) : 74 - 94
  • [28] Multi-script bibliographic database: an Indian perspective
    Chandrakar, R
    ONLINE INFORMATION REVIEW, 2002, 26 (04) : 246 - 251
  • [29] Word Level Multi-Script Identification Using Curvelet Transform in Log-Polar Domain
    Sahare, Parul
    Chaudhari, Ravindra E.
    Dhok, Sanjay B.
    IETE JOURNAL OF RESEARCH, 2019, 65 (03) : 410 - 432
  • [30] Identifying script on word-level with informational confidence
    Jaeger, S
    Ma, HF
    Doermann, D
    EIGHTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, PROCEEDINGS, 2005, : 416 - 420