DEVELOPMENT OF COMPUTATIONAL LINGUISTIC RESOURCES FOR AUTOMATED DETECTION OF TEXTUAL CYBERBULLYING THREATS IN ROMAN URDU LANGUAGE

被引：17

作者：

Dewani, Amirita ^{[1
]}

Memon, Mohsin Ali ^{[1
]}

Bhatti, Sania ^{[1
]}

机构：

[1] Mehran Univ Engn & Technol, Jamshoro, Sindh, Pakistan

来源：

3C TIC | 2021年 / 10卷 / 02期

关键词：

Linguistic Resources; Cyberaggression; Cyberbullying; Hate Speech Detection; Abusive Language Automated Detection;

D O I：

10.17993/3ctic.2021.102.101-121

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Automatic Cyberbullying detection has remained very challenging task since social media content and conversations are usually posted in unstructured free-text form leaving behind the language norms. The major concern and gap in formulating cyberbullying detection strategies is scarcity of available linguistic resources typically for newly evolved languages. Roman Urdu has recently emerged and hence is a resource poor language. Urdu has been widely known as the national language of Pakistan. However, because of socio-cultural and multilingual aspects, Roman Urdu is used widely on the Internet by Asians and more specifically Pakistanis. To fulfil the above stated gap, this research work presents guidelines for data annotation process and developed two linguistic resources: (i) Annotated corpus in Roman Urdu Language for cyberaggression and offensive language detection. The process of data annotation involved bilingual annotators instead of crowdsourcing. It has the benefit of correctly annotating instances that constitute clear cases of cyberbullying without compromising data quality. The developed corpus is highly balanced (with almost negligible skew) unlike most of the existing corpuses even in mature languages. (ii) Processing textual information for NLP tasks involves Stop-word elimination as a sub phase. Stop words carry least semantic information and increase feature space as compared to the other tokens and index terms in corpora. We have developed domain specific stop words for Roman Urdu Language considering all the lexical variants and typically in the context of aggression detection and collected data. The work has been carried out using python programming language and Pycharm IDE.

引用

页码：101 / 121

页数：21

共 29 条

[11] Cyberbullying Detection With Fairness Constraints [J].

Gencoglu, Oguzhan .

IEEE INTERNET COMPUTING, 2021, 25 (01) :20-29

[12]

Huang Q., 2018, P 1 WORKSH TROLL AGG, P42

[13]

Ibrohim Muhammad Okky, 2018, Procedia Computer Science, V135, P222, DOI 10.1016/j.procs.2018.08.169

[14]

Kaur J., 2015, NATURAL LANGUAGE PRO, V5, P114, DOI 10.5958/2249-3220.2015.00015.4

[15]

Lili Hao, 2008, 2008 International Conference on Computer Science and Software Engineering (CSSE 2008), P718, DOI 10.1109/CSSE.2008.829

[16]

Mahlangu Thabo, 2019, 2018 INT C INT INN C, DOI [10.1109/ICONIC.2018.8601278, DOI 10.1109/ICONIC.2018.8601278]

[17] A Precisely Xtreme-Multi Channel Hybrid Approach for Roman Urdu Sentiment Analysis [J].

Mehmood, Faiza ;

Ghani, Muhammad Usman ;

Ibrahim, Muhammad Ali ;

Shahzadi, Rehab ;

Mahmood, Waqar ;

Asim, Muhammad Nabeel .

IEEE ACCESS, 2020, 8 :192740-192759

[18]

Namdeo P., 2017, P INVENTIVE COMMUNIC, P162

[19]

Özel SA, 2017, 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), P366, DOI 10.1109/UBMK.2017.8093411

[20] Resources and benchmark corpora for hate speech detection: a systematic review [J].

Poletto, Fabio ;

Basile, Valerio ;

Sanguinetti, Manuela ;

Bosco, Cristina ;

Patti, Viviana .

LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (02) :477-523

← 1 2 3 →