Text representation and classification based on bi-gram alphabet

被引:12
作者
Elghannam, Fatma [1 ]
机构
[1] Elect Res Inst, Cairo, Egypt
关键词
Text representation; Document classification; Feature extraction; Arabic document; Bi-gram alphabet; Support vector machine;
D O I
10.1016/j.jksuci.2019.01.005
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In text classification, texts have to be transformed into numeric representations suitable for the learning algorithms. A main problem with the commonly used bag of words method is the high dimensions of vector space, as well as the need for language-dependent tools. In the present study, text classification is performed based on a novel bi-gram alphabet approach to construct feature terms. The proposed approach has two main contributions to text classification area. First, we have demonstrated the possibility of using constant feature terms that are based on the standard alphabet without the need for the documents vocabularies; this definitely helps in reducing the dimensions of the vector space for large corpus. Second, it does not require natural language processing tools. The current work has proved the ability to classify collections of Arabic or English text documents successfully. It showed approximately 80% savings in vector space and 2% performance improvement compared to the best recorded results on Arabic dataset Aljazeera News. (C) 2019 The Author. Production and hosting by Elsevier B.V. on behalf of King Saud University.
引用
收藏
页码:235 / 242
页数:8
相关论文
共 38 条
[1]  
Aggarwal C. C, 2012, MINING TEXT DATA, DOI DOI 10.1007/978-1-4614-3223-4
[2]  
Al-Shalabi R., 2008, P 6 INT C INFORMATIC, P108
[3]   Arabic text classification using Polynomial Networks [J].
Al-Tahrawi, Mayy M. ;
Al-Khatib, Sumaya N. .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2015, 27 (04) :437-449
[4]  
Al-Thwaib E., 2014, World of Computer Science and Information Technology Journal (WCSIT), V4, P101
[5]  
Anitha N., 2013, INT J INNOVAT ENG TE, V3, P22
[6]  
[Anonymous], 1998, Austrian Res. Inst. Artif. Intell.
[7]  
[Anonymous], 2004, MOUR ABB
[8]  
[Anonymous], 2004, ALJ NEWS
[9]  
[Anonymous], Functional Arabic Morphology. Formal System and Implementation
[10]  
Bahassine S, 2017, J ENG SCI TECHNOL, V12, P1475