Sinhala Hate Speech Detection in Social Media using Text Mining and Machine learning

被引:9
作者
Sandaruwan, H. M. S. T. [1 ]
Lorensuhewa, S. A. S. [1 ]
Kalyani, M. A. L. [1 ]
机构
[1] Univ Ruhuna, Fac Sci, Dept Comp Sci, Matara, Sri Lanka
来源
2019 19TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER - 2019) | 2019年
关键词
Hate speech detection; Sinhala; machine learning; Natural language processing;
D O I
10.1109/icter48817.2019.9023655
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
With the rapid growth of Information technology and Computer Science, communication and presenting ideologies became easier than early decades. Since Social Media are available globally through the web, anyone can easily target a person or a group who belongs to a different culture or a different belief. Though everyone has a right to express his or her own ideas, it should not be harmful, as everyone has a right to be prevented from any kind of hate speeches. In Social Media, there are no automatic methods to detect a hate speech, so anyone can easily be targeted. Since social media service providers do not have good linguistic knowledge on some languages such as Sinhala, they may take a couple of days to remove hate related comments from the content once they noticed. Therefore, hate speech detection in Sinhala language is an urgent and important work to address. We propose lexicon based and machine learning based approaches to automatically detect Sinhala hate and offensive speeches that are being shared through Social Media. In our study, lexicon based approach was initiated with the lexicon generating process and corpus based lexicon gave 76.3% of accuracy for hate, offensive and neutral speech detection. Machine learning approach was begun with building a 3000 comments corpus which is evenly distributed among hate, offensive and neutral speeches. Using this comment corpus, we were able to identify best fitting feature groups and models for Sinhala hate speech detection. According to our experiments, character trigram with Multinomial Naive Bayes gave the highest recall value as 0.84 with 92.33% accuracy.
引用
收藏
页数:8
相关论文
共 6 条
[1]  
Davidson T, 2017, P INT AAAI C WEB SOC, P512
[2]  
Dias D. S, 2019, IDENTIFYING RACIST S, P1, DOI [10.1109/icter.2018.8615492, DOI 10.1109/ICTER.2018.8615492]
[3]  
Koffer S., 2018, MULT WIRTSCH MKWI 20
[4]  
Malmasi S., 2017, DETECTING HATE SPEEC, DOI [10.26615/978-954-452-049-6_062, DOI 10.26615/978-954-452-049-6_062]
[5]  
Mubarak H., 2017, P 1 WORKSH AB LANG O, P52, DOI [10.18653/v1/W17-3008, DOI 10.18653/V1/W17-3008]
[6]  
Welgama V., 2011, LANGUAGE, P2009