Multi-label emotion classification of Urdu tweets

被引：21

作者：

Ashraf, Noman ^{[1
]}

Khan, Lal ^{[2
]}

Butt, Sabur ^{[1
]}

Chang, Hsien-Tsung ^{[2
,3
,4
]}

Sidorov, Grigori ^{[1
]}

Gelbukh, Alexander ^{[1
]}

机构：

[1] Inst Politecn Nacl, CIC, Mexico City, DF, Mexico

[2] Chang Gung Univ, Dept Comp Sci & Informat Engn, Taoyuan, Taiwan

[3] Chang Gung Univ, Artificial Intelligence Res Ctr, Taoyuan, Taiwan

[4] Chang Gung Mem Hosp, Dept Phys Med & Rehabil, Taoyuan, Taiwan

来源：

PEERJ COMPUTER SCIENCE | 2022年 / 8卷

关键词：

Emotion detection; Emotion classification in Urdu; Multi-label emotion detection; Machine learning; Deep learning; Natural language processing; SENTIMENT; MODEL;

D O I：

10.7717/peerj-cs.896

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastaliq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learnin g algorithms (Convolutional Neural Networks (1 D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.

引用

页数：25

共 79 条

[1]

Adeeba F, 2011, P 9 WORKSHOP ASIAN L, P31

[2]

Alm C. O., 2005, P HUMAN LANGUAGE TEC, P579, DOI [DOI 10.3115/1220575.1220648, 10.3115/1220575.1220648]

[3]

Aman S, 2007, LECT NOTES ARTIF INT, V4629, P196

[4]

Ameer I, 2020, COMPUT SIST, V24, P1159, DOI [10.13053/CyS-24-3-3476, 10.13053/cys-24-3-3476]

[5] Threatening Language Detection and Target Identification in Urdu Tweets [J].

Amjad, Maaz ;

Ashraf, Noman ;

Zhila, Alisa ;

Sidorov, Grigori ;

Zubiaga, Arkaitz ;

Gelbukh, Alexander .

IEEE ACCESS, 2021, 9 (09) :128302-128313

[6] Using Stylometric Features for Sentiment Classification [J].

Anchieta, Rafael T. ;

Ricarte Neto, Francisco Assis ;

de Sousa, Rogerio Figueiredo ;

Moura, Raimundo Santos .

COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 :189-200

[7]

[Anonymous], 2010, P ACL 2010 C SHORT P

[8]

[Anonymous], 2013, NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets

[9]

[Anonymous], 2021, IEEE Trans. Broadcast.

[10]

Arshad MU, 2019, P 2019 IEEE TEX POW, P1

← 1 2 3 4 5 6 7 8 →