Newspaper text recognition in Bengali script using support vector machine

被引:0
作者
Ghosh, Rajib [1 ]
机构
[1] Natl Inst Technol Patna, Dept Comp Sci & Engn, Ashok Rajpath, Patna 800005, India
关键词
Newspaper text recognition; Bengali script; Support vector machine; HANDWRITTEN WORD RECOGNITION; OCR SYSTEM; DEVANAGARI;
D O I
10.1007/s11042-023-16862-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Newspapers contain huge amount of important information on current affairs as well as notable past events. Browsing the digital versions of newspaper documents will become much easier if the documents are indexed or transcribed. To enable the automatic transcription, some computer based systems need to be developed for automatic recognition of newspaper text. However, no such recognition system exists for Bengali script, the second most popular Indian script. This article proposes a newspaper text recognition system in Bengali script for the first time in the literature. Initially, each newspaper article is segmented into image and text portions. Then the text document is segmented into various text lines, then each text line into various words and each word into various characters. Various discriminating features have then been extracted from each character using different feature extraction techniques. The feature vector of each character has then been fed to the support vector machine (SVM) classifier to recognize each character of the newspaper document image. The performance of the proposed system has been evaluated on a self-generated dataset and it provides a text recognition accuracy of 97.78%.
引用
收藏
页码:32973 / 32991
页数:19
相关论文
共 40 条
  • [1] Open-vocabulary recognition of machine-printed Arabic text using hidden Markov models
    Ahmad, Irfan
    Mahmoud, Sabri A.
    Fink, Gernot A.
    [J]. PATTERN RECOGNITION, 2016, 51 : 97 - 111
  • [2] Combining Structure and Parameter Adaptation of HMMs for Printed Text Recognition
    Ait-Mohand, Kamel
    Paquet, Thierry
    Ragot, Nicolas
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (09) : 1716 - 1732
  • [3] Effective Multitask Deep Learning for IoT Malware Detection and Identification Using Behavioral Traffic Analysis
    Ali, Sajid
    Abusabha, Omar
    Ali, Farman
    Imran, Muhammad
    Abuhmed, Tamer
    [J]. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2023, 20 (02): : 1199 - 1209
  • [4] Recognition of printed arabic text based on global features and decision tree learning techniques
    Amin, A
    [J]. PATTERN RECOGNITION, 2000, 33 (08) : 1309 - 1323
  • [5] Aparna KG, 2002, LECT NOTES COMPUT SC, V2423, P53
  • [6] A font and size-independent OCR system for printed Kannada documents using support vector machines
    Ashwin, TV
    Sastry, PS
    [J]. SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2002, 27 (1): : 35 - 58
  • [7] An omnifont open-vocabulary OCR system for English and Arabic
    Bazzi, I
    Schwartz, R
    Makhoul, J
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1999, 21 (06) : 495 - 504
  • [8] A complete printed Bangla OCR system
    Chaudhuri, BB
    Pal, U
    [J]. PATTERN RECOGNITION, 1998, 31 (05) : 531 - 549
  • [9] Automatic detection of Alzheimer?s disease progression: An efficient information fusion approach with heterogeneous ensemble classifiers
    El-Sappagh, Shaker
    Ali, Farman
    Abuhmed, Tamer
    Singh, Jaiteg
    Alonso, Jose M.
    [J]. NEUROCOMPUTING, 2022, 512 : 203 - 224
  • [10] Ghosh R., 2020, Recent Adv. Comput. Sci. Commun. (Former. Recent Pat. Comput. Sci.), V13, P200, DOI [10.2174/2213275912666181127124711, DOI 10.2174/2213275912666181127124711]