Pattern Based Comprehensive Urdu Stemmer and Short Text Classification

被引:19
作者
Ali, Mubashir [1 ]
Khalid, Shehzad [1 ]
Aslam, Muhammad Haseeb [1 ]
机构
[1] Bahria Univ, Dept Comp Engn, Islamabad, Pakistan
关键词
Infix classes; infix rules; stemming rules; stemming lists; Urdu stemmer; short text classification;
D O I
10.1109/ACCESS.2017.2787798
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Urdu language is used by approximately 200 million people for spoken and written communications. The bulk of unstructured Urdu textual data is available in the world. We can employ data mining techniques to extract useful information from such a large, potentially informative base data. There are many text processing systems available to process unstructured textual data. However, these systems are mostly language specific with the large proportion of systems applicable to English text. This is primarily due to language-dependent preprocessing systems, mainly the stemming requirement. Stemming is a vital preprocessing step in the text mining process and its primary aim is to reduce grammatical words form, e.g., parts of speech, gender, tense, and so on, to their root form. In the proposed work, we have developed a rule-based comprehensive stemming method for Urdu text. This proposed Urdu stemmer has the ability to generate the stem of Urdu words as well as loan words that belong to borrowed languages, such as Arabic, Persian, and Turkish, by removing prefix, infix, and suffix from the words. In the proposed stemming technique, we introduced six novel Urdu infix words classes and a minimum word length rule to generate the stem of Urdu text. In order to cope with the challenge of Urdu infix stemming, we have developed infix stripping rules for introduced infix words classes and generic stemming rules for prefix and suffix stemming. We also present a probabilistic classification approach to classify Urdu short text. Different experiments are performed to demonstrate the effectiveness and efficacy of the proposed approach. Comparison with existing state-of-the art approaches is also made. Stemming accuracy results demonstrate the adoptability of the proposed stemming approach for a variety text processing applications.
引用
收藏
页码:7374 / 7389
页数:16
相关论文
共 34 条
[1]  
Ahmad Khan Sajjad, 2011, 2 WORKSHOP S SE ASIA, P46
[2]  
Ahmad S., 2012, P 3 WORKSH S SE AS N
[3]  
Al-Khuli M., 1991, DICT THEORETICAL LIN
[4]  
Ali A.R., 2009, P 7 INT C FRONT INF, P21
[5]  
Ali M., 2016, INT J COMPUTER APPL, V134, P10
[6]  
Ali S., 2014, J. Appl. Env. Biol. Sci., V4, P436
[7]  
Almeida JJ., 2011, P 6 IB C INF SYST TE, P1
[8]  
Anita K., 2006, TECH REP
[9]  
[Anonymous], INT J COMPUT SCI MOB
[10]  
Bacchin M, 2002, P ROM SEP, P161