Offensive-Language Detection on Multi-Semantic Fusion Based on Data Augmentation

被引:5
作者
Liu, Junjie [1 ]
Yang, Yong [1 ]
Fan, Xiaochao [1 ]
Ren, Ge [1 ]
Yang, Liang [2 ]
Ning, Qian [3 ,4 ]
机构
[1] Xinjiang Normal Univ, Sch Comp Sci & Technol, Urumqi 830000, Peoples R China
[2] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian 116000, Peoples R China
[3] Xinjiang Normal Univ, Sch Phys & Elect Engn, Urumqi 830000, Peoples R China
[4] Sichuan Univ, Coll Elect & Informat Engn, Chengdu 610000, Peoples R China
关键词
offensive language; data augmentation; MSF;
D O I
10.3390/asi5010009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The rapid identification of offensive language in social media is of great significance for preventing viral spread and reducing the spread of malicious information, such as cyberbullying and content related to self-harm. In existing research, the public datasets of offensive language are small; the label quality is uneven; and the performance of the pre-trained models is not satisfactory. To overcome these problems, we proposed a multi-semantic fusion model based on data augmentation (MSF). Data augmentation was carried out by back translation so that it reduced the impact of too-small datasets on performance. At the same time, we used a novel fusion mechanism that combines word-level semantic features and n-grams character features. The experimental results on the two datasets showed that the model proposed in this study can effectively extract the semantic information of offensive language and achieve state-of-the-art performance on both datasets.
引用
收藏
页数:12
相关论文
共 33 条
[1]  
Aggarwal P, 2019, P 13 INT WORKSH SEM, P678, DOI DOI 10.18653/V1/S19-2121
[2]   Detecting White Supremacist Hate Speech Using Domain Specific Word Embedding With Deep Learning and BERT [J].
Alatawi, Hind S. ;
Alhothali, Areej M. ;
Moria, Kawthar M. .
IEEE ACCESS, 2021, 9 :106363-106374
[3]  
Altin L.S.M., 2019, P ASS COMPUTATIONAL, DOI [10.18653/v1/s19-2120, DOI 10.18653/V1/S19-2120]
[4]   Deep Learning for Hate Speech Detection in Tweets [J].
Badjatiya, Pinkesh ;
Gupta, Shashank ;
Gupta, Manish ;
Varma, Vasudeva .
WWW'17 COMPANION: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2017, :759-760
[5]  
Chen X, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P3667
[6]  
Davidson T, 2017, Proceedings of the International AAAI Conference on Web and Social Media, V11, P512, DOI [10.1609/icwsm.v11i1.14955, 10.1609/icwsm.v11i1.14955, DOI 10.1609/ICWSM.V11I1.14955]
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Gamback Bjorn., 2017, P 1 WORKSHOP ABUSIVE, P85, DOI DOI 10.18653/V1/W17-3013
[9]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[10]  
Joulin Armand., 2017, EACL, V2017, P427