Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text

被引:12
作者
Molpeceres Barrientos, Gonzalo [1 ]
Alaiz-Rodriguez, Rocio [1 ]
Gonzalez-Castro, Victor [1 ]
Parnell, Andrew C. [2 ]
机构
[1] Univ Leon, Dept Elect Syst & Automat Engn, Campus Vegazana S-N, Leon, Spain
[2] Maynooth Univ, Hamilton Inst, Maynooth, Kildare, Ireland
基金
爱尔兰科学基金会;
关键词
Inappropriate content; Machine learning; Text classification; Natural language processing; Text encoders; CLASSIFICATION; TWITTER;
D O I
10.2991/ijcis.d.200519.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, children have access to Internet on a regular basis. Just like the real world, the Internet has many unsafe locations where kids may be exposed to inappropriate content in the form of obscene, aggressive, erotic or rude comments. In this work, we address the problem of detecting erotic/sexual content on text documents using Natural Language Processing (NLP) techniques. Following an approach based on Machine Learning techniques, we have assessed twelve models resulting from the combination of three text encoders (Bag of Words, Term Frequency-Inverse Document Frequency and Word2vec) together with four classifiers (Support Vector Machines (SVMs), Logistic Regression, k-Nearest Neighbors and Random Forests). We evaluated these alternatives on a new created dataset extracted from public data on the Reddit Website. The best performance result was achieved by the combination of the text encoder TF-IDF and the SVM classifier with linear kernel with an accuracy of 0.97 and F-score 0.96 (precision 0.96/recall 0.95). This study demonstrates that it is possible to detect erotic content on text documents and therefore, develop filters for minors or according to user's preferences. (C) 2020 The Authors. Published by Atlantis Press SARL.
引用
收藏
页码:591 / 603
页数:13
相关论文
共 54 条
  • [1] Aggarwal C C., 2016, Recommender Systems, P139
  • [2] Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms
    Agrawal, Sweta
    Awekar, Amit
    [J]. ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 141 - 153
  • [3] Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network
    Al-garadr, Mohammed Ali
    Varathan, Kasturi Dewi
    Ravana, Sri Devi
    [J]. COMPUTERS IN HUMAN BEHAVIOR, 2016, 63 : 433 - 443
  • [4] [Anonymous], 1999, MODERN INFORM RETRIE
  • [5] [Anonymous], 2013, COMPUTER SCI
  • [6] [Anonymous], 2013, P INT C COMP LEARN R
  • [7] [Anonymous], 2010, SEARCH ENGINES INFOR
  • [8] Image denoising via an improved non-local total variation model
    Bai, Yunjiao
    Liu, Yi
    Zhang, Quan
    Jia, Lina
    Gui, Zhiguo
    [J]. JOURNAL OF ENGINEERING-JOE, 2018, (08): : 745 - 752
  • [9] Detecting Inappropriate Comments to News
    Bellan, Patrizio
    Strapparava, Carlo
    [J]. AI*IA 2018 - ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, 11298 : 403 - 414
  • [10] Bojanowski P., 2017, Trans. Assoc. Comput. Linguist., V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACL_A_00051]