Text Backdoor Detection Using an Interpretable RNN Abstract Model

被引:24
作者
Fan, Ming [1 ]
Si, Ziliang [1 ]
Xie, Xiaofei [2 ]
Liu, Yang [2 ]
Liu, Ting [1 ]
机构
[1] Xi An Jiao Tong Univ, Sch Cyber Sci & Engn, MoEKLINNS Lab, Xian 710049, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
基金
中国国家自然科学基金; 新加坡国家研究基金会; 中国博士后科学基金;
关键词
Training; Recurrent neural networks; Task analysis; Motion pictures; Data models; Analytical models; Sentiment analysis; Text backdoor detection; RNN; model abstraction; interpretation;
D O I
10.1109/TIFS.2021.3103064
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Deep neural networks (DNNs) are known to be inherently vulnerable to malicious attacks such as the adversarial attack and the backdoor attack. The former is crafted by adding small perturbations to benign inputs so as to fool a DNN. The latter generally embeds a hidden pattern in a DNN by poisoning the dataset during the training process, which causes the infected model to misbehave on predefined inputs with a specific trigger and normally perform for others. Much work has been conducted on defending against the adversarial samples, while the backdoor attack received much less attention, especially in recurrent neural networks (RNNs), which play an important role in the text processing field. Two main limitations make it hard to directly apply existing image backdoor detection approaches to RNN-based text classification systems. First, a layer in an RNN does not preserve the same feature latent space function for different inputs, making it impossible to map the inserted specific pattern with the neural activations. Second, the text data is inherently discrete, making it hard to optimize the text like image pixels. In this work, we propose a novel backdoor detection approach named InterRNN for RNN-based text classification systems from the interpretation perspective. Specifically, we first propose a novel RNN interpretation technique by constructing a nondeterministic finite automaton (NFA) based abstract model, which can effectively reduce the analysis complexity of an RNN while preserving its original logic rules. Then, based on the abstract model, we can obtain interpretation results that explain the fundamental reason behind the decision for each input. We then detect trigger words by leveraging the differences between the behaviors in the backdoor sentences and those in the normal sentences. The extensive experiment results on four benchmark datasets demonstrate that our approach can generate better interpretation results compared to state-of-the-art approaches and effectively detect backdoors in RNNs.
引用
收藏
页码:4117 / 4132
页数:16
相关论文
共 50 条
  • [1] LEARNING REGULAR SETS FROM QUERIES AND COUNTEREXAMPLES
    ANGLUIN, D
    [J]. INFORMATION AND COMPUTATION, 1987, 75 (02) : 87 - 106
  • [2] [Anonymous], 2020, TEXT ANAL APIS
  • [3] [Anonymous], 2020, AG NEWS CLASSIFICATI
  • [4] [Anonymous], 2019, KERAS 2 3 1
  • [5] [Anonymous], 2020, DEEPSPEECH
  • [6] [Anonymous], 2019, GloVe: Global Vectors for Word Representation
  • [7] [Anonymous], 2018, Toxic comment classification challenge
  • [8] BakIr G., 2007, Predicting structured data
  • [9] Towards Evaluating the Robustness of Neural Networks
    Carlini, Nicholas
    Wagner, David
    [J]. 2017 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP), 2017, : 39 - 57
  • [10] Chen Bryant, 2018, ARXIV181103728