HHSD: Hindi Hate Speech Detection Leveraging Multi-Task Learning

被引:3
作者
Kapil, Prashant [1 ]
Kumari, Gitanjali [1 ]
Ekbal, Asif [1 ]
Pal, Santanu [2 ]
Chatterjee, Arindam [1 ,2 ]
Vinutha, B. N. [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna 800013, India
[2] Wipro AI, Bengaluru 560035, India
关键词
Earth; Annotations; Social networking (online); Hate speech; Tagging; Linguistics; Transformers; multi-task learning; F1; score; accuracy; Shared layers;
D O I
10.1109/ACCESS.2023.3312993
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Hate speech is now a frequent occurrence on social media. Recently, the majority of study was devoted to identifying hate speech in languages with abundant resources (e.g., English). However, relatively few works are developed for languages with limited resources (e.g., Hindi, the third most widely used language on earth). In this study, Hindi Hate Speech Dataset (HHSD) is created following a novel hierarchical fine-grained four-layer annotation approach. The top layer separates the posts into hateful and non-hateful categories. The second layer further categorises hateful posts into explicit hateful and implicit hateful. The third layer is the multilabel tagging of the post into topics, such as political, religion, racism, or sexism. The fourth layer involves the identification of the targeted named entity, either explicitly or implicitly. Additionally, a thorough evaluation of the data annotation schema for trustworthy annotation is provided. The HHSD data is the largest multi-layer annotated corpora in Hindi compared with the existing multi-layer annotated data. Experiments on the dataset using the transformer-based approaches in single-task learning (STL) attain encouraging performances in accuracy and weighted-f1 score. The experiment leveraged multi-task learning (MTL) by including multiple related hate speech detection tasks from high-resource English and languages from the same linguistic family such as Urdu and Bangla with a transformer encoder as the shared layers to obtain a significant increment of 5.31% and 5.35% over STL in accuracy and weighted-f1 for layer A, 8.20%, and 22.83% for layer B. The MTL surpasses STL by 8.98% and 4.07% in exact match and hamming loss for layer C.
引用
收藏
页码:101460 / 101473
页数:14
相关论文
共 63 条
  • [1] Abadi M, 2016, arXiv, DOI [DOI 10.48550/ARXIV.1603.04467, 10.48550/arxiv.1603.04467]
  • [2] Azam U, 2022, LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P4523
  • [3] Basile V., 2019, P 13 INT WORKSH SEM, P54, DOI DOI 10.18653/V1/S19-2007
  • [4] Bhardwaj M, 2020, Arxiv, DOI [arXiv:2011.03588, DOI 10.48550/ARXIV.2011.03588]
  • [5] Bhat I. A., 2015, P FORUM INFORM RETRI, P48
  • [6] Bhattacharya S, 2020, Arxiv, DOI [arXiv:2003.07428, DOI 10.48550/ARXIV.2003.07428]
  • [7] Fighting hate speech from bilingual hinglish speaker's perspective, a transformer- and translation-based approach.
    Biradar, Shankar
    Saumya, Sunil
    Chauhan, Arun
    [J]. SOCIAL NETWORK ANALYSIS AND MINING, 2022, 12 (01)
  • [8] Bohra A., 2018, PROC WORKSHOP HATE S, P36, DOI [DOI 10.18653/V1/W18-1105, DOI 10.18653/V1/W18]
  • [9] Bojanowski P., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacl_a_00051, DOI 10.1162/TACLA00051]
  • [10] Breazzano C., 2021, P NL4AI, P1