Detection of Backdoors in Trained Classifiers Without Access to the Training Set

被引:31
作者
Xiang, Zhen [1 ]
Miller, David J. [1 ,2 ]
Kesidis, George [1 ,2 ]
机构
[1] Penn State Univ, Sch Elect Engn & Comp Sci, University Pk, PA 16803 USA
[2] Anomalee Inc, State Coll, PA 16803 USA
关键词
Anomaly detection (AD); backdoor; data poisoning (DP); order statistics; reverse engineering (RE); robust density estimation;
D O I
10.1109/TNNLS.2020.3041202
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With wide deployment of deep neural network (DNN) classifiers, there is great potential for harm from adversarial learning attacks. Recently, a special type of data poisoning (DP) attack, known as a backdoor (or Trojan), was proposed. These attacks do not seek to degrade classification accuracy, but rather to have the classifier learn to classify to a target class t* whenever the backdoor pattern is present in a test example originally from a source class s*. Launching backdoor attacks does not require knowledge of the classifier or its training process-only the ability to poison the training set with exemplars containing a backdoor pattern (labeled with the target class). Defenses against backdoors can be deployed before/during training, post-training, or at test time. Here, we address post-training detection in DNN image classifiers, seldom considered in existing works, wherein the defender does not have access to the poisoned training set, but only to the trained classifier itself, as well as to clean (unpoisoned) examples from the classification domain. This scenario is of great interest because e.g., a classifier may be the basis of a phone app that will be shared with many users. Detection may thus reveal a widespread attack. We propose a purely unsupervised anomaly detection (AD) defense against imperceptible backdoor attacks that: 1) detects whether the trained DNN has been backdoorattacked; 2) infers the source and target classes in a detected attack; 3) estimates the backdoor pattern itself. Our AD approach involves learning (via suitable cost function minimization) the minimum size/norm perturbation (putative backdoor) required to induce the classifier to misclassify (most) examples from class s to class t, for all (s, t) pairs. Our hypothesis is that nonattacked pairs require large perturbations, while the attacked pair (s*, t*) requires much smaller ones. This is convincingly borne out experimentally. We identify a variety of plausible cost functions and devise a novel, robust hypothesis testing approach to perform detection inference. We test our approach, in comparison with the state-of-the-art methods, for several backdoor patterns, attack settings and mechanisms, and data sets and demonstrate its favorability. Our defense essentially requires setting a single hyperparameter (the detection threshold), which can e.g., be chosen to fix the system's false positive rate.
引用
收藏
页码:1177 / 1191
页数:15
相关论文
共 35 条
[21]   DeepFool: a simple and accurate method to fool deep neural networks [J].
Moosavi-Dezfooli, Seyed-Mohsen ;
Fawzi, Alhussein ;
Frossard, Pascal .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2574-2582
[22]   Practical Black-Box Attacks against Machine Learning [J].
Papernot, Nicolas ;
McDaniel, Patrick ;
Goodfellow, Ian ;
Jha, Somesh ;
Celik, Z. Berkay ;
Swami, Ananthram .
PROCEEDINGS OF THE 2017 ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY (ASIA CCS'17), 2017, :506-519
[23]   The Limitations of Deep Learning in Adversarial Settings [J].
Papernot, Nicolas ;
McDaniel, Patrick ;
Jha, Somesh ;
Fredrikson, Matt ;
Celik, Z. Berkay ;
Swami, Ananthram .
1ST IEEE EUROPEAN SYMPOSIUM ON SECURITY AND PRIVACY, 2016, :372-387
[24]  
Saha A, 2020, AAAI CONF ARTIF INTE, V34, P11957
[25]   ESTIMATING DIMENSION OF A MODEL [J].
SCHWARZ, G .
ANNALS OF STATISTICS, 1978, 6 (02) :461-464
[26]  
Szegedy C, 2014, Arxiv, DOI [arXiv:1312.6199, DOI 10.1109/CVPR.2015.7298594]
[27]  
Tramèr F, 2016, PROCEEDINGS OF THE 25TH USENIX SECURITY SYMPOSIUM, P601
[28]  
Tran B, 2018, ADV NEUR IN, V31
[29]   Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks [J].
Wang, Bolun ;
Yao, Yuanshun ;
Shan, Shawn ;
Li, Huiying ;
Viswanath, Bimal ;
Zheng, Haitao ;
Zhao, Ben Y. .
2019 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2019), 2019, :707-723
[30]  
Wang YJ, 2019, INT CONF ACOUST SPEE, P8063, DOI [10.1109/ICASSP.2019.8682578, 10.1109/icassp.2019.8682578]