A lexicon based approach for classifying Arabic multi-labeled text

被引:14
作者
Hmeidi, Ismail [1 ]
Al-Ayyoub, Mahmoud [1 ]
Mahyoub, Nizar A. [1 ]
Shehab, Mohammed A. [1 ]
机构
[1] Jordan Univ Sci & Technol, Irbid, Jordan
关键词
Label-set dimensionality; Lexicon-based multi-label classification; ML-Accuracy; Multi-label data; Single-label data;
D O I
10.1108/IJWIS-01-2016-0002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose - Multi-label Text Classification (MTC) is one of the most recent research trends in data mining and information retrieval domains because of many reasons such as the rapid growth of online data and the increasing tendency of internet users to be more comfortable with assigning multiple labels/tags to describe documents, emails, posts, etc. The dimensionality of labels makes MTC more difficult and challenging compared with traditional single-labeled text classification (TC). Because it is a natural extension of TC, several ways are proposed to benefit from the rich literature of TC through what is called problem transformation (PT) methods. Basically, PT methods transform the multi-label data into a single-label one that is suitable for traditional single-label classification algorithms. Another approach is to design novel classification algorithms customized for MTC. Over the past decade, several works have appeared on both approaches focusing mainly on the English language. This work aims to present an elaborate study of MTC of Arabic articles. Design/methodology/approach - This paper presents a novel lexicon-based method for MTC, where the keywords that are most associated with each label are extracted from the training data along with a threshold that can later be used to determine whether each test document belongs to a certain label. Findings - The experiments show that the presented approach outperforms the currently available approaches. Specifically, the results of our experiments show that the best accuracy obtained from existing approaches is only 18 per cent, whereas the accuracy of the presented lexicon-based approach can reach an accuracy level of 31 per cent. Originality/value - Although there exist some tools that can be customized to address the MTC problem for Arabic text, their accuracies are very low when applied to Arabic articles. This paper presents a novel method for MTC. The experiments show that the presented approach outperforms the currently available approaches.
引用
收藏
页码:504 / 532
页数:29
相关论文
共 36 条
[1]   Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums [J].
Abbasi, Ahmed ;
Chen, Hsinchun ;
Salem, Arab .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2008, 26 (03)
[2]   Selecting Attributes for Sentiment Classification Using Feature Relation Networks [J].
Abbasi, Ahmed ;
France, Stephen ;
Zhang, Zhu ;
Chen, Hsinchun .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (03) :447-462
[3]  
Abdulla N., 2013, INT J BIG DATA INTEL, V1, P103
[4]   Automatic Lexicon Construction for Arabic Sentiment Analysis [J].
Abdulla, Nawaf ;
Majdalawi, Roa'a ;
Mohammed, Salwa ;
Al-Ayyoub, Mahmoud ;
Al-Kabi, Mohammed .
2014 INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD), 2014, :547-552
[5]  
Aggarwal C. C., 2012, MINING TEXT DATA, P77, DOI [10.1007/978-1-4614-3223-4, DOI 10.1007/978-1-4614-3223-4]
[6]  
Ahmed N., 2015, INT C INF COMM SYST
[7]  
Al Shboul B, 2015, 2015 6TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), P206, DOI 10.1109/IACS.2015.7103228
[8]  
Al-Harbi S., 2008, P 9 INT C STAT AN TE
[9]  
Al-Kabi MN, 2013, INT CONF INTERNET, P89, DOI 10.1109/ICIST.2013.6747511
[10]  
Alwajeeh A., 2014, INF COMM SYST ICICS