Multi-label emotion classification of Urdu tweets

被引:21
作者
Ashraf, Noman [1 ]
Khan, Lal [2 ]
Butt, Sabur [1 ]
Chang, Hsien-Tsung [2 ,3 ,4 ]
Sidorov, Grigori [1 ]
Gelbukh, Alexander [1 ]
机构
[1] Inst Politecn Nacl, CIC, Mexico City, DF, Mexico
[2] Chang Gung Univ, Dept Comp Sci & Informat Engn, Taoyuan, Taiwan
[3] Chang Gung Univ, Artificial Intelligence Res Ctr, Taoyuan, Taiwan
[4] Chang Gung Mem Hosp, Dept Phys Med & Rehabil, Taoyuan, Taiwan
关键词
Emotion detection; Emotion classification in Urdu; Multi-label emotion detection; Machine learning; Deep learning; Natural language processing; SENTIMENT; MODEL;
D O I
10.7717/peerj-cs.896
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastaliq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learnin g algorithms (Convolutional Neural Networks (1 D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
引用
收藏
页数:25
相关论文
共 79 条
[1]  
Adeeba F, 2011, P 9 WORKSHOP ASIAN L, P31
[2]  
Alm C. O., 2005, P HUMAN LANGUAGE TEC, P579, DOI [DOI 10.3115/1220575.1220648, 10.3115/1220575.1220648]
[3]  
Aman S, 2007, LECT NOTES ARTIF INT, V4629, P196
[4]  
Ameer I, 2020, COMPUT SIST, V24, P1159, DOI [10.13053/CyS-24-3-3476, 10.13053/cys-24-3-3476]
[5]   Threatening Language Detection and Target Identification in Urdu Tweets [J].
Amjad, Maaz ;
Ashraf, Noman ;
Zhila, Alisa ;
Sidorov, Grigori ;
Zubiaga, Arkaitz ;
Gelbukh, Alexander .
IEEE ACCESS, 2021, 9 (09) :128302-128313
[6]   Using Stylometric Features for Sentiment Classification [J].
Anchieta, Rafael T. ;
Ricarte Neto, Francisco Assis ;
de Sousa, Rogerio Figueiredo ;
Moura, Raimundo Santos .
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 :189-200
[7]  
[Anonymous], 2010, P ACL 2010 C SHORT P
[8]  
[Anonymous], 2013, NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets
[9]  
[Anonymous], 2021, IEEE Trans. Broadcast.
[10]  
Arshad MU, 2019, P 2019 IEEE TEX POW, P1