Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents

被引:0
|
作者
Obaidullah, Sk Md [1 ]
Santosh, K. C. [2 ]
Halder, Chayan [3 ]
Das, Nibaran [4 ]
Roy, Kaushik [3 ]
机构
[1] Aliah Univ Kolkata, Dept Comp Sci & Engn, Kolkata, W Bengal, India
[2] Univ South Dakota, Dept Comp Sci, Vermillion, SD 57069 USA
[3] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata, India
[4] West Bengal State Univ, Dept Comp Sci, Kolkata, India
关键词
Multi-script documents; Official indic script database; Script identification;
D O I
10.1007/978-981-10-4859-3_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).
引用
收藏
页码:16 / 27
页数:12
相关论文
共 50 条
  • [11] Page-level Script Identification from Multi-script Handwritten Documents
    Singh, Pawan Kumar
    Dalal, Santu Kumar
    Sarkar, Ram
    Nasipuri, Mita
    2015 THIRD INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION, CONTROL AND INFORMATION TECHNOLOGY (C3IT), 2015,
  • [12] Automatic Indic script identification from handwritten documents: page, block, line and word-level approach
    Sk Md Obaidullah
    K. C. Santosh
    Chayan Halder
    Nibaran Das
    Kaushik Roy
    International Journal of Machine Learning and Cybernetics, 2019, 10 : 87 - 106
  • [13] Automatic Indic script identification from handwritten documents: page, block, line and word-level approach
    Obaidullah, Sk Md
    Santosh, K. C.
    Halder, Chayan
    Das, Nibaran
    Roy, Kaushik
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (01) : 87 - 106
  • [14] Identification of different script lines from multi-script documents
    Pal, U
    Chaudhuri, BB
    IMAGE AND VISION COMPUTING, 2002, 20 (13-14) : 945 - 954
  • [15] HVS inspired system for script identification in Indian multi-script documents
    Pati, PB
    Ramakrishnan, AG
    DOCUMENT ANALYSIS SYSTEMS VII, PROCEEDINGS, 2006, 3872 : 380 - 389
  • [16] Statistical comparison of classifiers for script identification from multi-script handwritten documents
    Singh, Pawan Kumar
    Sarkar, Ram
    Das, Nibaran
    Basu, Subhadip
    Nasipuri, Mita
    INTERNATIONAL JOURNAL OF APPLIED PATTERN RECOGNITION, 2014, 1 (02) : 152 - 172
  • [17] Multi-script line identification from Indian documents
    Pal, U
    Sinha, S
    Chaudhuri, BB
    SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 880 - 884
  • [18] Improved word-level handwritten Indic script identification by integrating small convolutional neural networks
    Ukil, Soumya
    Ghosh, Swarnendu
    Obaidullah, Sk Md
    Santosh, K. C.
    Roy, Kaushik
    Das, Nibaran
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (07): : 2829 - 2844
  • [19] Improved word-level handwritten Indic script identification by integrating small convolutional neural networks
    Soumya Ukil
    Swarnendu Ghosh
    Sk Md Obaidullah
    K. C. Santosh
    Kaushik Roy
    Nibaran Das
    Neural Computing and Applications, 2020, 32 : 2829 - 2844
  • [20] Script line separation from Indian multi-script documents
    Pal, U
    Chaudhuri, BB
    IETE JOURNAL OF RESEARCH, 2003, 49 (01) : 3 - 11