An efficient character recognition method using enhanced HOG for spam image detection

被引:27
作者
Naiemi, Fatemeh [1 ]
Ghods, Vahid [1 ]
Khalesi, Hassan [2 ]
机构
[1] Islamic Azad Univ, Semnan Branch, Dept Elect & Comp Engn, Semnan, Iran
[2] Islamic Azad Univ, Garmsar Branch, Dept Elect & Comp Engn, Garmsar, Iran
关键词
Spam detection; OCR; Histogram of oriented gradients; Enhanced HOG; SVM; Social media; Security; ROC curve; TESTS;
D O I
10.1007/s00500-018-03728-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generally, a spam image is an unsolicited message electronically sent to a wide group of arbitrary addresses. Due to attractiveness and more difficult detection, spam images are the most complicated type of spam. One of the ways to encounter the spam images is an optical character recognition, OCR, method. In this paper, the proposed enhanced HOG feature extraction method has been used so that the optical character recognition system of spam has been enhanced by using the HOG feature extraction method in such a way to be both resistant against the character variations on scale and translation and to be computationally cost-effective. For these purposes, two steps of the cropped image and input image size normalization have been added to pre-processing stages. Support vector machine, SVM, was employed for classification. Two heuristic modifications including thickening of the thin characters in the pre-processing stage and non-discrimination in detecting the uppercase and lowercase letters with the same shapes in the classification stage have been also proposed to increase the system recognition accuracy. In the first heuristic modification, when all pixels of the output image are empty (the character is eliminated), the original image was made thicker by one layer. In the second modification, when recognizing the letters, no differentiation was considered between the uppercase and lowercase letters with the same shapes. An average recognition accuracy of the modified HOG method with two heuristic modifications equals 91.61% on Char74K database. Then, an optimum threshold for classification was investigated by ROC curve. The optimal cutoff point was 0.736 with the highest average accuracy, 94.20%, and AUC, area under curve, for ROC and precision-recall, PR, curves were 0.96 and 0.73, respectively. The proposed method was also examined on ICDAR2003 database, and the average accuracy and its optimum using ROC curve were 82.73% and 86.01%, respectively. These results of recognition accuracy and AUC for ROC and PR curve showed an outstanding enhancement in comparison with the best recognition rate of the previous methods.
引用
收藏
页码:11759 / 11774
页数:16
相关论文
共 39 条
[1]  
Alghamdi B, 2016, 2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), P5, DOI [10.1109/WIW.2016.41, 10.1109/WIW.2016.014]
[2]  
[Anonymous], INT J ENG TRENDS TEC
[3]  
Aradhye H., 2005, INT J NATURAL LANGUA, P914, DOI DOI 10.5121/IJNLC.2014.3313
[4]   A survey of image spamming and filtering techniques [J].
Attar, Abdolrahman ;
Rad, Reza Moradi ;
Atani, Reza Ebrahimi .
ARTIFICIAL INTELLIGENCE REVIEW, 2013, 40 (01) :71-105
[5]  
Bhowmick A., 2016, ARXIV160601042
[6]  
Bowling JR, 2008, SPAM IMAGE IDENTIFIC, P44003
[7]   An approach to the script discrimination in the Slavic documents [J].
Brodic, Darko ;
Milivojevic, Zoran N. ;
Maluckov, Cedomir A. .
SOFT COMPUTING, 2015, 19 (09) :2655-2665
[8]   A SVM-based cursive character recognizer [J].
Camastra, Francesco .
PATTERN RECOGNITION, 2007, 40 (12) :3721-3727
[9]   Statistical Features-Based Real-Time Detection of Drifted Twitter Spam [J].
Chen, Chao ;
Wang, Yu ;
Zhang, Jun ;
Xiang, Yang ;
Zhou, Wanlei ;
Min, Geyong .
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2017, 12 (04) :914-925
[10]   An intelligent character recognition method to filter spam images on cloud [J].
Chen, Jun ;
Zhao, Hong ;
Yang, Jufeng ;
Zhang, Jian ;
Li, Tao ;
Wang, Kai .
SOFT COMPUTING, 2017, 21 (03) :753-763