Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

被引:11
作者
Velankar, Abhishek [1 ,3 ]
Patil, Hrushikesh [1 ,3 ]
Joshi, Raviraj [2 ,3 ]
机构
[1] Pune Inst Comp Technol, Pune, Maharashtra, India
[2] Indian Inst Technol Madras, Chennai, Tamil Nadu, India
[3] L3Cube, Pune, Maharashtra, India
来源
ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, ANNPR 2022 | 2023年 / 13739卷
关键词
Natural language processing; Text classification; Hate speech detection; Sentiment analysis; BERT; Marathi BERT;
D O I
10.1007/978-3-031-20650-4_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. In this work, we conduct a comparative study between monolingual and multilingual BERT models. We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis, and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT, and xlm-RoBERTa and compare them with MahaBERT, MahaALBERT, and MahaRoBERTa, the monolingual models for Marathi. We further show that Marathi monolingual models outperform the multilingual BERT variants in five different downstream fine-tuning experiments. We also evaluate sentence embeddings from these models by freezing the BERT encoder layers. We show that monolingual MahaBERT-based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts. However, we observe that these embeddings are not generic enough and do not work well on out-of-domain social media datasets. We consider two Marathi hate speech datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification dataset L3Cube-MahaSent, and Marathi Headline, Articles classification datasets.
引用
收藏
页码:121 / 128
页数:8
相关论文
共 28 条
[1]  
Conneau A, 2020, Arxiv, DOI [arXiv:1911.02116, DOI 10.48550/ARXIV.1911.02116, 10.48550/arXiv.1911.02116]
[2]  
de Vries W, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P4339
[3]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[4]  
Paula AFM, 2021, Arxiv, DOI arXiv:2111.04530
[5]  
Ghaddar A, 2022, Arxiv, DOI [arXiv:2112.04329, 10.48550/arxiv.2112.04329, DOI 10.48550/ARXIV.2112.04329]
[6]  
Islam M., 2017, ARXIV
[7]  
Jain K, 2020, Arxiv, DOI arXiv:2011.02323
[8]  
Jawahar G, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P3651
[9]  
Joshi R., 2021, 2021 6 INT C CONV TE, P1, DOI DOI 10.1109/I2CT51068.2021.9418073
[10]   Deep Learning for Hindi Text Classification: A Comparison [J].
Joshi, Ramchandra ;
Goel, Purvi ;
Joshi, Raviraj .
INTELLIGENT HUMAN COMPUTER INTERACTION (IHCI 2019), 2020, 11886 :94-101