Fighting Adversarial Attacks on Online Abusive Language Moderation

被引：2

作者：

Rodriguez, Nestor ^{[1
]}

Rojas-Galeano, Sergio ^{[1
]}

机构：

[1] Univ Dist Francisco Jose de Caldas, Sch Engn, Bogota, Colombia

来源：

APPLIED COMPUTER SCIENCES IN ENGINEERING, WEA 2018, PT I | 2018年 / 915卷

关键词：

Abusive language moderation; Adversarial attacks; Text pattern recognition;

D O I：

10.1007/978-3-030-00350-0_40

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Lack of moderation in online conversations may result in personal aggression, harassment or cyberbullying. Such kind of hostility is usually expressed by using profanity or abusive language. On the basis of this assumption, recently Google has developed a machine-learning model to detect hostility within a comment. The model is able to assess to what extent abusive language is poisoning a conversation, obtaining a "toxicity" score for the comment. Unfortunately, it has been suggested that such a toxicity model can be deceived by adversarial attacks that manipulate the text sequence of the abusive language. In this paper we aim to fight this anomaly; firstly we characterise two types of adversarial attacks, one using obfuscation and the other using polarity transformations. Then, we propose a two-stage approach to disarm such attacks by coupling a text deobfuscation method and the toxicity scoring model. The approach was validated on a dataset of approximately 24000 distorted comments showing that it is feasible to restore the toxicity score of the adversarial variants. We anticipate that combining machine learning and text pattern recognition methods operating on different layers of linguistic features, will help to foster aggression-safe online conversations despite the adversary challenges inherent to the versatile nature of written language.

引用

页码：480 / 493

页数：14

共 11 条

[1]

[Anonymous], 2017, CoRR

[2] Us and them: identifying cyber hate on Twitter across multiple protected characteristics [J].

Burnap, Pete ;

Williams, Matthew L. .

EPJ DATA SCIENCE, 2016, 5

[3] Industry watch NLP in a post-truth world [J].

Dale, Robert .

NATURAL LANGUAGE ENGINEERING, 2017, 23 (02) :319-324

[4]

Hosseinmardi H., 2016, P 1 INT WORKSH COMP

[5] Analyzing Labeled Cyberbullying Incidents on the Instagram Social Network [J].

Hosseinmardi, Homa ;

Mattson, Sabrina Arredondo ;

Ibn Rafiq, Rahat ;

Han, Richard ;

Lv, Qin ;

Mishra, Shivakant .

SOCIAL INFORMATICS (SOCINFO 2015), 2015, 9471 :49-66

[6] Machine learning in adversarial environments [J].

Laskov, Pavel ;

Lippmann, Richard .

MACHINE LEARNING, 2010, 81 (02) :115-119

[7]

Nobata Chikashi, 2016, P 25 INT C WORLD WID

[8] On Obstructing Obscenity Obfuscation [J].

Rojas-Galeano, Sergio .

ACM TRANSACTIONS ON THE WEB, 2017, 11 (02)

[9]

Samanta S., 2017, Towards crafting text adversarial samples

[10] Back to swear one: A review of English language literature on swearing and cursing in Western health settings [J].

Stone, Teresa Elizabeth ;

McMillan, Margaret ;

Hazelton, Mike .

AGGRESSION AND VIOLENT BEHAVIOR, 2015, 25 :65-74

← 1 2 →